Scalable ML Model Validation
In this project, I build a automated pipeline to help data scientists and engineers stress-test mission-critical machine learning models over hundreds of million of images/samples/scenarios. Specififically, I use a Tensorflow based traffic light detector as an example.
The pipeline consists of the following three stages in tandem:
-
Data storage: I use Amazon Web Service (AWS) S3 services
-
Distributed computing/model validation: I use a Apache Spark cluster that consits of one master and three workers.
-
Database: I use PostgreSQL to build a relational database
Dependencies
- Java 8 + OpenJDK
- Zookeeper 3.4.9
- Kafka 0.10.1.1
- Hadoop 2.7.4
- Spark 2.1.1
- pyspark 2.1.1+hadoop2.7
- OpenCV 3.1.0
- TensorFlow 1.2.0 and 1.8.0
- boto 2.48.0
- PostgreSQL 9.5.13
- psycopg2 2.7.5
Installation
git clone https://github.com/kcg2015/Insight_DE_Project.git
Running the scripts
Create a database
sudo -u postgres psql
sudo -u postgres createdb -O data_engineer test_result_db
Spark batch processing
bin/spark-submit --master spark://ip-10-0-0-9:7077 \
--conf "spark.executor.memory=5g" \
--py-files /home/ubuntu/Insight_DE_project/tl_detector.py,/home/ubuntu/Insight_DE_project/s3_util.py,/home/ubuntu/Insight_DE_project/db_util.py \
/home/ubuntu/Insight_DE_projcet/main.py