HeatWave is an integrated, massively parallel, high-performance, in-memory query accelerator for MySQL Database Service that accelerates performance of MySQL by orders of magnitude for analytics and mixed workloads. It is the only service that enables you to run OLTP and OLAP workloads simultaneously and directly from your MySQL database, without any changes to your applications. This eliminates the need for complex, time-consuming, and expensive data movement and integration with a separate analytics database. Your applications connect to the HeatWave cluster through standard MySQL protocols.
MySQL HeatWave users currently do not have an easy way of creating machine-learning models for their data in the database, or generating predictions and explanations for it. Such users, while being database experts, frequently are relatively new to Machine Learning and can benefit from products that streamline the creation and usage of machine learning models. HeatWave ML is the product that addresses this need.
This set of benchmarks is based around popularly used datasets in Machine Learning fetched from multiple sources.
Benchmark | Explanation | #Rows (Training Set) | #Features |
---|---|---|---|
airlines | Predict Flight Delays | 377568 | 8 |
bank_marketing | Direct marketing – Banking Products | 31648 | 17 |
cnae-9 | Documents with free text business descriptions of Brazilian companies | 757 | 857 |
connect-4 | 8-ply positions in the game of connect-4 in which neither player has won yet – predict win/loss | 47290 | 161 |
fashion_mnist | Clothing classification problem | 60000 | 785 |
nomao | Active learning is used to efficiently detect data that refer to a same place based on Nomao browser | 24126 | 119 |
numerai | Data is cleaned, regularized and encrypted global equity data | 67425 | 22 |
higgs | Monte Carlo Simulations | 10500000 | 29 |
census | Determine if a person makes > 50k | 32561 | 15 |
titanic | Survival Status of individuals | 917 | 14 |
creditcard | Identify fraudulent transactions | 199364 | 30 |
appetency | Predict the propensity of customers to buy new products | 35000 | 230 |
black_friday | Customer purchases on Black Friday | 116774 | 10 |
diamonds | Predict price of a diamond | 37758 | 10 |
mercedes | Time the car took to pass testing | 2946 | 377 |
news_popularity | Predict the number of shares of article in social networks (popularity) | 27750 | 60 |
nyc_taxi | Predict tip amount for NYC taxi cab | 407284 | 15 |
The popularity of a topic on social media | 408275 | 78 |
- Provision MySQL Database Service instance and add a 2-node HeatWave cluster.
- Clone this repository and change directories
git clone https://github.com/oracle-samples/heatwave-ml.git
cd heatwave-ml
- Create a Python virtual environment and activate it as follows
python3.8 -m venv py_heatwaveml
source py_heatwaveml/bin/activate
- Install the necessary Python packages
pip install pandas==1.4.2 numpy==1.22.3 unlzw3==0.2.1 scikit-learn==1.0.2 pyreadr --user
Click on the link below to download the respective benchmark. You can also use wget from the command line.
airlines
bank_marketing
cnae-9
connect-4
fashion_mnist
- https://github.com/zalandoresearch/fashion-mnist/blob/master/data/fashion/t10k-images-idx3-ubyte.gz
- https://github.com/zalandoresearch/fashion-mnist/blob/master/data/fashion/t10k-labels-idx1-ubyte.gz
- https://github.com/zalandoresearch/fashion-mnist/blob/master/data/fashion/train-images-idx3-ubyte.gz
- https://github.com/zalandoresearch/fashion-mnist/blob/master/data/fashion/train-labels-idx1-ubyte.gz
nomao
numerai
higgs
census
- https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
- https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test
titanic
creditcard
appetency
nyc_taxi
news_popularity
black_friday
mercedes
diamonds
After you have downloaded a benchmark, run the preprocess.py script with the benchmark name as below
python3 heatwave-ml/preprocess.py --benchmark <name>
Launch MySQL Shell as below
mysqlsh user@hostname --mysql --sql
On the mysql-shell prompt, run
> source heatwave-ml/<benchmark_name>.sql
where <benchmark_name> is a name from the above table. The train and test csvs generated above should be present in the current directory in MySQL Shell. Each SQL file will create the schemas for a benchmark, train a HeatWave ML model on it, and score the model on the test data. The test score will be output at the e end.
In order to run scalability numbers for HeatWave ML, for the benchmarks above, run the ML_TRAIN commands from the sql files above for each benchmark on 1, 2, 4, 8 and 16 nodes. Measure the end-to-end training time (ML_TRAIN time from MySQL client perspective) for each configuration (benchmark + number of nodes). Graphing the number of nodes against the runtime on each node should give the scalability for a benchmark.
This project welcomes contributions from the community. Before submitting a pull request, please review our contribution guide
Please consult the security guide for our responsible security vulnerability disclosure process
Copyright (c) 2023 Oracle and/or its affiliates.
Released under the Universal Permissive License v1.0 as shown at https://oss.oracle.com/licenses/upl/.