mllite/ml2cpp

ml2cpp step 2 : C++ code design

Closed this issue · 12 comments

Need to have a complete specification for the following :

  1. Test datasets (CSV file => C++ std::map)
  2. Classification/Regression/Transformation models : C++ functions used to compute the scores
  3. Classification/Regression/Transformation models : input/output datasets layouts

This spec should evolve when more and more models/features are added.

The automatically generated code is plain STL C++-17, designed to maintain a strong semantic mapping with the model and allows auditing , debugging and reporting.

The C++ code contains everything needed to compute the predicted values of the model, no external library is needed, and can be compiled for any target hardware platform using any starndard C++ compiler on the market.

Typical generated code for a classification model :

source : https://github.com/antoinecarme/ml2cpp/blob/master/doc/LinearModels/ml2cpp_ridge_classifier_iris.ipynb

namespace  {

	std::vector<std::string> get_input_names(){
		std::vector<std::string> lFeatures = { "Feature_0", "Feature_1", "Feature_2", "Feature_3" };

		return lFeatures;
	}

	std::vector<std::any> get_classes(){
		std::vector<std::any> lClasses = { 0, 1, 2 };

		return lClasses;
	}

	std::vector<std::string> get_output_names(){
		std::vector<std::string> lOutputs = { 
			"Score_0", "Score_1", "Score_2",
			"Proba_0", "Proba_1", "Proba_2",
			"LogProba_0", "LogProba_1", "LogProba_2",
			"Decision", "DecisionProba" };

		return lOutputs;
	}

	tTable compute_classification_scores(std::any Feature_0, std::any Feature_1, std::any Feature_2, std::any Feature_3) {
		auto lClasses = get_classes();

		std::any score_0 = 0.12726862685332171 * Feature_0 + 0.47083648636124975 * Feature_1 + -0.445366165446255 * Feature_2 + -0.1212031740524525 * Feature_3 + -0.6974613707207697;

		std::any score_1 = -0.027607436590965116 * Feature_0 + -0.8779879015502388 * Feature_1 + 0.3719963526001637 * Feature_2 + -0.8328757056773156 * Feature_3 + 2.1132211020903813;

		std::any score_2 = -0.09966119026235667 * Feature_0 + 0.40715141518899145 * Feature_1 + 0.0733698128460983 * Feature_2 + 0.9540788797297515 * Feature_3 + -2.4157597313696253;


		tTable lTable;

		lTable["Score"] = { 
			score_0,
			score_1,
			score_2 
		} ;
		lTable["Proba"] = { 
			std::any(),
			std::any(),
			std::any() 
		} ;
		int lBestClass = get_arg_max( lTable["Score"] );
		auto lDecision = lClasses[lBestClass];
		lTable["Decision"] = { lDecision } ;
		lTable["DecisionProba"] = { lTable["Proba"][lBestClass] };

		recompute_log_probas( lTable );

		return lTable;
	}

	tTable compute_model_outputs_from_table( tTable const & iTable) {
		tTable lTable = compute_classification_scores(iTable.at("Feature_0")[0], iTable.at("Feature_1")[0], iTable.at("Feature_2")[0], iTable.at("Feature_3")[0]);

		return lTable;
	}

} // eof namespace 

std::any is used for all types of data, scores , probabilties etc. It is more generic and concise than std::variant. It requires C++-17.

A test dataset is a std c++ map (tTable) that assigns to each column name a vector of std::any (class scores are stored in the same vector, class probabitlities in another one, features are stored separately, etc)

typedef std::vector<std::any> tAnyVector;
typedef std::map<std::string, tAnyVector> tTable;

An input dataset is a particular feature dataset (tTable).

A model output is also a particular dataset (tTable). Models can be chained by taking the output of the previous model as input.

There is some kind of algebra on tTables. 'softmax' is a special operation that takes a tTable with scores and produces a tTable of probabilities. An average of tTables is a tTable (random forest tTable = mean(tTable output of trees)), etc. This algebra is to be extended as more and more complex models are added.

tTables can be read and written to and from CSV files or database tables.

For readability : Each model is a specific C++ namespace. Sub-models (in meta-models and ensembles ) and layers in NNs are also namespaces. This also allows using tens of models generated separately in the same C++ program.

For readability : Use main algorithm steps with meaningful / human-friendly names (map code vocabulary and semantics to the model). The user should be able to validate/inspect/debug the model by looking at the C++ code.

TODO: check if there is a limit on the number of namespaces in the various compilers. A common random forest with 500 trees will generate a C++ code with at least 500 namespaces. SQL allows this, why not C++.

https://github.com/antoinecarme/sklearn2sql_heroku/blob/master/docs/WebService-RandomForest_512_Deploy.ipynb

TODO : check using classes instead of namespaces. A class IS a namespace.

The compiled code should not rely on any external library. C++ is enough to compute any machine learning model "by hand".

antoine@z600:/tmp$ ldd sklearn2sql_cpp_iris_RidgeClassifier_140045544887056.exe
        linux-vdso.so.1 (0x00007ffe7413a000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f8135758000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8135614000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f81355fa000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8135435000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f8135982000)

Typical generated code for a regression model (even simpler) :

source : https://github.com/antoinecarme/ml2cpp/blob/master/doc/LinearModels/ml2cpp_ridge_regressor-boston.ipynb

namespace  {

	std::vector<std::string> get_input_names(){
		std::vector<std::string> lFeatures = { "Feature_0", "Feature_1", "Feature_2", "Feature_3", "Feature_4", "Feature_5", "Feature_6", "Feature_7", "Feature_8", "Feature_9", "Feature_10", "Feature_11", "Feature_12" };

		return lFeatures;
	}

	std::vector<std::string> get_output_names(){
		std::vector<std::string> lOutputs = { "Estimator" };

		return lOutputs;
	}

	tTable compute_regression(std::any Feature_0, std::any Feature_1, std::any Feature_2, std::any Feature_3, std::any Feature_4, std::any Feature_5, std::any Feature_6, std::any Feature_7, std::any Feature_8, std::any Feature_9, std::any Feature_10, std::any Feature_11, std::any Feature_12) {

		tTable lTable;

		std::any  lEstimator = -0.10222110133730666 * Feature_0 + 0.04773129624686468 * Feature_1 + -6.436208742578908e-05 * Feature_2 + 2.627820041255508 * Feature_3 + -11.121694375850694 * Feature_4 + 3.8789420030475736 * Feature_5 + -0.005439894300973365 * Feature_6 + -1.3800822175215268 * Feature_7 + 0.29004395043741604 * Feature_8 + -0.013003140540395218 * Feature_9 + -0.8831486448890916 * Feature_10 + 0.009736544133046856 * Feature_11 + -0.5359293002502585 * Feature_12 + 31.308451803397112;
		lTable[ "Estimator" ] = { lEstimator };

		return lTable;
	}

	tTable compute_model_outputs_from_table( tTable const & iTable) {
		tTable lTable = compute_regression(iTable.at("Feature_0")[0], iTable.at("Feature_1")[0], iTable.at("Feature_2")[0], iTable.at("Feature_3")[0], iTable.at("Feature_4")[0], iTable.at("Feature_5")[0], iTable.at("Feature_6")[0], iTable.at("Feature_7")[0], iTable.at("Feature_8")[0], iTable.at("Feature_9")[0], iTable.at("Feature_10")[0], iTable.at("Feature_11")[0], iTable.at("Feature_12")[0]);

		return lTable;
	}

} // eof namespace 

Typical generated code for a feature transformation :

source : https://github.com/antoinecarme/ml2cpp/blob/master/doc/Transformations/ml2cpp_transform_std_scaler_iris.ipynb

namespace  {

	std::vector<std::string> get_input_names(){
		std::vector<std::string> lFeatures = { "Feature_0", "Feature_1", "Feature_2", "Feature_3" };

		return lFeatures;
	}

	std::vector<std::string> get_output_names(){
		std::vector<std::string> lOutputs = { "Feature_0", "Feature_1", "Feature_2", "Feature_3" };

		return lOutputs;
	}

	tTable compute_features(std::any Feature_0, std::any Feature_1, std::any Feature_2, std::any Feature_3) {

		tTable lTable;

		lTable["Feature_0"] = { ( ( Feature_0 - 5.843333333333334 ) / 0.8253012917851409 ) };
		lTable["Feature_1"] = { ( ( Feature_1 - 3.0573333333333337 ) / 0.4344109677354946 ) };
		lTable["Feature_2"] = { ( ( Feature_2 - 3.7580000000000005 ) / 1.759404065775303 ) };
		lTable["Feature_3"] = { ( ( Feature_3 - 1.1993333333333336 ) / 0.7596926279021594 ) };

		return lTable;
	}

	tTable compute_model_outputs_from_table( tTable const & iTable) {
		tTable lTable = compute_features(iTable.at("Feature_0")[0], iTable.at("Feature_1")[0], iTable.at("Feature_2")[0], iTable.at("Feature_3")[0]);

		return lTable;
	}

} // eof namespace 

Closing

Typical generated code for an outlier detection (sklearn.covariance._elliptic_envelope.EllipticEnvelope) :

namespace  {

        std::vector<std::string> get_input_names(){
                std::vector<std::string> lFeatures = { "A", "B" };

                return lFeatures;
        }

        std::vector<std::string> get_output_names(){
                std::vector<std::string> lOutputs = { 
                        "AnomalyScore","OutlierIndicator" };

                return lOutputs;
        }
        tTable compute_outlier_scores(std::any A, std::any B) {
                std::any A_c = A - 0.0;

                std::any B_c = B - 0.0;

                std::any lMahalanobis = 4.000000000000003 * A_c * A_c + -6.000000000000005 * A_c * B_c + -6.000000000000004 * B_c * A_c + 10.000000000000009 * B_c * B_c;

                std::any lScore = -lMahalanobis -(-2.0000000000000018);


                tTable lTable;

                lTable["AnomalyScore"] = { lScore } ;
                lTable["OutlierIndicator"] = { ( lScore >= 0.0 ) ? 1 : -1 } ;

                return lTable;
        }

        tTable compute_model_outputs_from_table( tTable const & iTable) {
                tTable lTable = compute_outlier_scores(iTable.at("A")[0], iTable.at("B")[0]);

                return lTable;
        }

} // eof namespace