/AirlineDelayBenchmark

A Python code for benchmarking the reading and processing of the airline delay CSV data

Primary LanguageJupyter NotebookOtherNOASSERTION

Github repo

link

System requirement

  • hard disk needs to have more than 300 GB of free space
    • the raw CSV files take up 72 GB
    • the converted HDF5 files take up 210 GB (because NaNs are stored as 64 bit floats, and other metadata for the column variable types)

Data source of the airline delay data

This is a very popular dataset that has been in numerous ML examples.

Goals

This is a non-trivial benchmark with (72 GB of data in CSV format) to compare the predictive performance, timing, memory usage and scalability

  • between various machine learning & deep learning algorithms
  • between the same algorithm implemented in different frameworks and languages

We understand similar examples have been performed before, such as 1, 2 and

and 3 those library developments for dask-learn are now deprecated and replaced by 4. Another related Dask example is 5 . Other examples using different Hadoop-related software stacks can be found at 6, [7] and 8 and 9

We do not claim to have the original idea but wish to provide a completely reproducible, realistic example to showcase the best practises for benchmarking ML code and compare various parallelism approaches. The main goal is provide an assessment of the amount of compute resources needed to process a certain amount of data. The benchmark is also not meant to be completely comprehensive due to the large possible configuration space. We pick a few algorithms that are popular and the author is familiar with. We encourage others to contribute their own implementations in a reproducible way for comparisons.

We hope to illustrate:

  1. realistic data preprocessing steps
  2. a few best practises for unit testing and benchmarking machine learning code in Python
  3. best practises to deploy a reproducible ML software stack on different machines
  4. how to setup Intel optimized ML libraries for the best performances or help push for more performance

To achieve those goals, We perform the following tasks:

  1. classification of whether a flight is delayed or not

      • Boosting from
        • XGBoost
        • PyDAAL
        • Microsoft 's LightGBM
      • Random Forest from
        • Scikit-Learn
        • Spark ML
      • SVM with non-linear kernel from (linear kernel is almost the same as Logistic regression)
        • Scikit-Learn
        • Spark ML
      • Logistic regression from Scikit-learn for baseline comparison
        • Scikit-Learn
        • Spark ML
      • an appropriate neural network topology, possibly a deep feed forward network
  2. regression of how long a flight is delayed for (within the delayed population)

      • Boosting from XGBoost
      • Random Forest
        • Scikit-Learn
        • Spark ML
      • Linear regression from Scikit-learn for baseline comparison
      • an appropriate neural network topology, possibly a deep feed forward network

And we will explore if we can find any natural clusters (subpopulation) that are present in the dataset.

Target hardware

  • one node of server class Xeon (Haswell / Broadwell), and / or
  • Xeon Phi (KNL)

Files

.
|
+---_config  : contains set up script for Intel Python 
+---_src     : contains python scripts for processing the data 
|   +---read_csv_and_print_stat.py runs within 2 mins
+---_results : benchmark results
+---_doc     : other documentation 
+---_data
+---_benchmark : code for benchmarking

IO references

One-time software setup

Go to config and source install_py35_env.sh for installing Intel Python to your home directory. Other possible dependencies

We plan to make the list of software that we use available as

  • a Conda environment yaml list within config
  • a Dockerfile / Docker image for the best setup instructions

Subsequent use of code after the one-time software setup

Email Karen for suggestions for improving the setups. Thanks.

notes about data

Processing 2003.csv.bz2 gives a warning message during ETL.

sys:1: DtypeWarning: Columns (22) have mixed types. Specify dtype option on
import or set low_memory=False.

Restoring the original environment

$ source ./config/unload_ipy35_env.sh This restores the PATH variable to the original state.

Credits

Thanks to Duncan Temple Lang who first showed me the dataset. Please also see his book XML and Web Technologes for Data Science with R. for another use of a dataset from the same source.