Algorithm Performance Analysis

This is a collection of tools to analyze the performance of complex algorithms. The tools use machine learning and statistics to look at real data to determine performance bottlenecks and the relationships between parameters and runtime. In the future we want to help determine appropriate hardware constraints (e.g. memory) for efficient use of resources. We also hope to make a tool for reliable runtime prediction. It is made with Galaxy Project admins in mind.

What this can do:

collect historical data from your galaxy database
manipulate the data
determine which parameters affect the runtime of the algorithm
inspect the impact of a single parameter on runtime

Future plans:

determine an appropriate run walltime for tools
determine appropriate memory allocation
determine approptiate processor cores

Collect Data From Galaxy Database

use get_tool_run_info.py to collect data from your galaxy database. The output file formats supported are csv and json.

There is also an example data file available: bwa_mem_0.7.15.1_example.csv

options	default	description
--config	default="config/galaxy.yml"	config of your galaxy
--toolid	required	tool id (e.g. "toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/0.9")
--outfile	default="job_data.csv"	Output file, extension determines format.

Best input format for current and future analysis

The best input format to use is a comma-separated values (CSV).

Manipulate Data

Use (csv_file_manipulation.ipynb)[csv_file_manipulation.ipynb] for suggestions on how to view and manipulate csv data with python. It has examples to inspect the data, delete data, transform data, and combine data.

A full example of data manipulation can be found at (finding_outliers.ipynb)[finding_outliers.ipynb]

Feature Importances with Random Forests

This tool (feature_importances_with_random_forests.py) estimates the relative impact of parameters on the runtime of tools. It does so by fitting a Random Forest Regressor to a historical dataset (of parameters and runtimes) and determining the Mean Decrease Impurity of each parameter. The Mean Decrease Impurity is an estimate of how much the Random Forest uses the parameter in it's decisions.

The tool accepts a .tsv or .csv file. Here is a sample csv file. The file should have one column labeled "runtime", which the Random Forest will treat as the dependent variable to predict.

The tool will warn you if you have a parameter with more than 30 categories, or if you have a parameter that is monotonic (such as an id or a constant number)

The tool should take less than a minute to finish. It will save the important features in a .tsv file, and optionally save a plot to a .png file.

options	default	description
--filename	required	the name of the .csv or .tsv file with the data
--outfile	default="feature_importances.tsv"	name of a .tsv file where you want the output
--plot_outfile	default=None	If you want a plot, use this to name the .png file you want it saved to. Otherwise, leave as default.
--runtime_label	default="runtime"	this specifies the label of the variable to predict in your dataset
--unite_categorical_features	default=True	whether to give the importances of categorical features with one number, or to give the importance of each seperate category (e.g. give importance of "color" vs. give importance of "color_blue", "color_green", "color_yellow")

Inspect single feature

Inspect the effect of a single parameter on runtime while holding all of the other parameters constant. (single_feature_analysis.py)

The tool accepts a tsv or csv file, and it requires a feature_of_interest and a runtime column to be named. It sorts the dataset into sets of jobs that all have similar parameters. Then it saves the parameters of the largest sets of jobs in files named parameters_i.tsv and saves plots of feature_of_interest vs runtime to plot_i.png.

For example, say you put in a csv file like this:

runtime	feature_of_interest	category	number
5	1	True	0
10	5	False	0
2	3	True	7
3	2	True	6
4	1	False	25

First, the continuous columns will be placed into bins like so

runtime	feature_of_interest	category	number
5	1	True	0
10	5	False	0
2	3	True	(5, 7.5]
3	2	True	(5, 7.5]
4	1	False	(23.5, 25]

The largest set with equivelent parameters is chosen:

runtime	feature_of_interest	category	number
2	3	True	(5, 7.5]
3	2	True	(5, 7.5]

And the values of the equivelent parameters is saved to parameters_i.tsv and the plot of runtime vs. feature_of_interest is saved to plot_i.png. You can choose to inspect the n largest sets with the option --num_to_plot.

options	default	description
--filename	required	the name of the .csv or .tsv file with the data
--feature_of_interest	requireds	name parameter to inspect
--runtime_label	default="runtime"	this specifies the label of the variable to predict in your dataset
--num_to_plot	default=1	number of parameter sets to examine
--outdir	default='single_feature'	the name of the directory the output file should go

Predicting

Once you are ready to build a random forest predictor, train_model.py will train a random forest and save the model in a pickle file.

options	default	description
--filename	required	the name of the .csv or .tsv file with the data
--model_outfile	default='model.pkl'	name of a .pkl file where you want the output
--plot_outfile	default='plot.png'	If you want a plot, use this to name the .png file you want it saved to. Otherwise, leave as default.
--runtime_label	default="runtime"	this specifies the label of the variable to predict in your dataset
--split_train_test	default=False	if you are making a plot, do you want to split the dataset into a training and testing set, or do you want to use the whole dataset for both training and testing
--split_randomly	default=False	if split_train_test == True, this specifies whether to split the data randomly

Then use the model.pkl to predict the runtime of new instances using predict_runtime_with_model.py

options	default	description
--filename	required	the name of the .csv or .tsv file with the data
--model_filename	required	the name of the .pkl of the model
--plot_outfile	default='plot.png'	To get a plot you must also provide a runtime_label
--runtime_label	default="runtime"	may be set to None. this specifies the label of the variable to predict in your dataset. may be left empty if you don't want prediction metrics
--single_prediction	default=False	if you want a single prediction, setting this to True will return the prediction and nothing else

davebx/algorithm-performance-analysis