Uchit @ Spark+AI Summit Europe 2019 - Auto-Pilot for Apache Spark Using Machine Learning
Uchit helps in auto tuning the configurations for a Spark Application using Machine Learning and other mathematical abstract models. Auto tuning of Spark configuration is largely a manual effort, needs Big Data domain specific expertise and is not scalable. Uchit would help in automating this process.
Uchit takes the data from the previous runs of the same application (config values and run time of the application) as input. ML model trained on this data then predicts the best performing config out of the sampled configs.
- Data from the previous runs of the same application is used as a training data after being normalized.
- ML models is trained on the samples gathered in step 1.
- Using Latin HyperCube sampling(LHS), representative samples are picked from the combinerd domain of the configs so that the entire sample space is covered.
- Math + any other model, prunes the non-optimal samples using domain specific knowledge.
- Out of selected samples in step 4, ML model predicts the best performing sample.
- Predicted best performing config is denormalized and returned to the user.
from spark.combiner.combiner import Combiner
combiner = Combiner(4, 26544)
- 4 is the number of cores per worker node.
- 26544 is the total memory per worker node in MB.
training_data_1 = {
"spark.executor.memory": 11945,
"spark.sql.shuffle.partitions": 200,
"spark.executor.cores": 2,
"spark.driver.memory": 1024 * 2,
"spark.sql.autoBroadcastJoinThreshold": 10,
"spark.sql.statistics.fallBackToHdfs": 0
}
runtime_in_sec = 248
combiner.add_training_data(training_data_1, runtime_in_sec)
best_config = combiner.get_best_config()
print best_config
{'spark.executor.cores': 4, 'spark.driver.memory': 2048, 'spark.sql.statistics.fallBackToHdfs': 1, 'spark.sql.autoBroadcastJoinThreshold': 100, 'spark.executor.memory': 23889, 'spark.sql.shuffle.partitions': 200}
- Compute the best config for next run.
- Run the job with the suggested configuration.
- Run with suggested configs in Step 3 and based upon this run add new training data to model.
- Get the new best config to run the job again.
- Repeat this process for predefined
n
number of times.
Please see Uchit Tutorial for more details.
If you have any feedback, feel free to shoot an email to amoghm@qubole.com or mayurb@qubole.com