- hard disk needs to have more than 300 GB of free space
- the raw CSV files take up 72 GB
- the converted HDF5 files take up 210 GB (because NaNs are stored as 64 bit floats, and other metadata for the column variable types)
This is a very popular dataset that has been in numerous ML examples.
This is a non-trivial benchmark with (72 GB of data in CSV format) to compare the predictive performance, timing, memory usage and scalability
- between various machine learning & deep learning algorithms
- between the same algorithm implemented in different frameworks and languages
We understand similar examples have been performed before, such as 1, 2 and
and 3 those library developments for dask-learn are now deprecated and replaced by 4. Another related Dask example is 5 . Other examples using different Hadoop-related software stacks can be found at 6, [7] and 8 and 9
We do not claim to have the original idea but wish to provide a completely reproducible, realistic example to showcase the best practises for benchmarking ML code and compare various parallelism approaches. The main goal is provide an assessment of the amount of compute resources needed to process a certain amount of data. The benchmark is also not meant to be completely comprehensive due to the large possible configuration space. We pick a few algorithms that are popular and the author is familiar with. We encourage others to contribute their own implementations in a reproducible way for comparisons.
We hope to illustrate:
- realistic data preprocessing steps
- a few best practises for unit testing and benchmarking machine learning code in Python
- best practises to deploy a reproducible ML software stack on different machines
- how to setup Intel optimized ML libraries for the best performances or help push for more performance
To achieve those goals, We perform the following tasks:
-
classification of whether a flight is delayed or not
-
- Boosting from
-
XGBoost
-
PyDAAL
- Microsoft 's
LightGBM
-
- Boosting from
-
- Random Forest from
-
Scikit-Learn
-
Spark ML
-
- Random Forest from
-
- SVM with non-linear kernel from (linear kernel is almost the same as Logistic regression)
-
Scikit-Learn
-
Spark ML
-
- SVM with non-linear kernel from (linear kernel is almost the same as Logistic regression)
-
- Logistic regression from
Scikit-learn
for baseline comparison-
Scikit-Learn
-
Spark ML
-
- Logistic regression from
-
- an appropriate neural network topology, possibly a deep feed forward network
-
-
regression of how long a flight is delayed for (within the delayed population)
-
- Boosting from
XGBoost
- Boosting from
-
- Random Forest
-
Scikit-Learn
-
Spark ML
-
- Random Forest
-
- Linear regression from
Scikit-learn
for baseline comparison
- Linear regression from
-
- an appropriate neural network topology, possibly a deep feed forward network
-
And we will explore if we can find any natural clusters (subpopulation) that are present in the dataset.
- one node of server class Xeon (Haswell / Broadwell), and / or
- Xeon Phi (KNL)
.
|
+---_config : contains set up script for Intel Python
+---_src : contains python scripts for processing the data
| +---read_csv_and_print_stat.py runs within 2 mins
+---_results : benchmark results
+---_doc : other documentation
+---_data
+---_benchmark : code for benchmarking
Go to config
and source install_py35_env.sh
for installing Intel Python to your home directory.
Other possible dependencies
We plan to make the list of software that we use available as
- a Conda environment
yaml
list withinconfig
- a Dockerfile / Docker image for the best setup instructions
Email Karen for suggestions for improving the setups. Thanks.
Processing 2003.csv.bz2
gives a warning message during ETL.
sys:1: DtypeWarning: Columns (22) have mixed types. Specify dtype option on
import or set low_memory=False.
$ source ./config/unload_ipy35_env.sh
This restores the PATH variable to the original state.
Thanks to Duncan Temple Lang who first showed me the dataset.
Please also see his book XML and Web Technologes for Data Science with R
.
for another use of a dataset from the same source.