This is a collection of tools to analyze the performance of complex algorithms. The tools use machine learning and statistics to look at real data to determine performance bottlenecks and the relationships between parameters and runtime. In the future we want to help determine appropriate hardware constraints (e.g. memory) for efficient use of resources. We also hope to make a tool for reliable runtime prediction. It is made with Galaxy Project admins in mind.
- collect historical data from your galaxy database
- manipulate the data
- determine which parameters affect the runtime of the algorithm
- inspect the impact of a single parameter on runtime
- determine an appropriate run walltime for tools
- determine appropriate memory allocation
- determine approptiate processor cores
use get_tool_run_info.py to collect data from your galaxy database. The output file formats supported are csv and json.
There is also an example data file available: bwa_mem_0.7.15.1_example.csv
options | default | description |
---|---|---|
--config | default="config/galaxy.yml" | config of your galaxy |
--toolid | required | tool id (e.g. "toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/0.9") |
--outfile | default="job_data.csv" | Output file, extension determines format. |
The best input format to use is a comma-separated values (CSV).
Use (csv_file_manipulation.ipynb)[csv_file_manipulation.ipynb] for suggestions on how to view and manipulate csv data with python. It has examples to inspect the data, delete data, transform data, and combine data.
A full example of data manipulation can be found at (finding_outliers.ipynb)[finding_outliers.ipynb]
This tool (feature_importances_with_random_forests.py) estimates the relative impact of parameters on the runtime of tools. It does so by fitting a Random Forest Regressor to a historical dataset (of parameters and runtimes) and determining the Mean Decrease Impurity of each parameter. The Mean Decrease Impurity is an estimate of how much the Random Forest uses the parameter in it's decisions.
The tool accepts a .tsv or .csv file. Here is a sample csv file. The file should have one column labeled "runtime", which the Random Forest will treat as the dependent variable to predict.
The tool will warn you if you have a parameter with more than 30 categories, or if you have a parameter that is monotonic (such as an id or a constant number)
The tool should take less than a minute to finish. It will save the important features in a .tsv file, and optionally save a plot to a .png file.
options | default | description |
---|---|---|
--filename | required | the name of the .csv or .tsv file with the data |
--outfile | default="feature_importances.tsv" | name of a .tsv file where you want the output |
--plot_outfile | default=None | If you want a plot, use this to name the .png file you want it saved to. Otherwise, leave as default. |
--runtime_label | default="runtime" | this specifies the label of the variable to predict in your dataset |
--unite_categorical_features | default=True | whether to give the importances of categorical features with one number, or to give the importance of each seperate category (e.g. give importance of "color" vs. give importance of "color_blue", "color_green", "color_yellow") |
Inspect the effect of a single parameter on runtime while holding all of the other parameters constant. (single_feature_analysis.py)
The tool accepts a tsv or csv file, and it requires a feature_of_interest and a runtime column to be named. It sorts the dataset into sets of jobs that all have similar parameters. Then it saves the parameters of the largest sets of jobs in files named parameters_i.tsv and saves plots of feature_of_interest vs runtime to plot_i.png.
For example, say you put in a csv file like this:
runtime | feature_of_interest | category | number |
---|---|---|---|
5 | 1 | True | 0 |
10 | 5 | False | 0 |
2 | 3 | True | 7 |
3 | 2 | True | 6 |
4 | 1 | False | 25 |
First, the continuous columns will be placed into bins like so
runtime | feature_of_interest | category | number |
---|---|---|---|
5 | 1 | True | 0 |
10 | 5 | False | 0 |
2 | 3 | True | (5, 7.5] |
3 | 2 | True | (5, 7.5] |
4 | 1 | False | (23.5, 25] |
The largest set with equivelent parameters is chosen:
runtime | feature_of_interest | category | number |
---|---|---|---|
2 | 3 | True | (5, 7.5] |
3 | 2 | True | (5, 7.5] |
And the values of the equivelent parameters is saved to parameters_i.tsv and the plot of runtime vs. feature_of_interest is saved to plot_i.png. You can choose to inspect the n largest sets with the option --num_to_plot.
options | default | description |
---|---|---|
--filename | required | the name of the .csv or .tsv file with the data |
--feature_of_interest | requireds | name parameter to inspect |
--runtime_label | default="runtime" | this specifies the label of the variable to predict in your dataset |
--num_to_plot | default=1 | number of parameter sets to examine |
--outdir | default='single_feature' | the name of the directory the output file should go |
Once you are ready to build a random forest predictor, train_model.py will train a random forest and save the model in a pickle file.
options | default | description |
---|---|---|
--filename | required | the name of the .csv or .tsv file with the data |
--model_outfile | default='model.pkl' | name of a .pkl file where you want the output |
--plot_outfile | default='plot.png' | If you want a plot, use this to name the .png file you want it saved to. Otherwise, leave as default. |
--runtime_label | default="runtime" | this specifies the label of the variable to predict in your dataset |
--split_train_test | default=False | if you are making a plot, do you want to split the dataset into a training and testing set, or do you want to use the whole dataset for both training and testing |
--split_randomly | default=False | if split_train_test == True, this specifies whether to split the data randomly |
Then use the model.pkl to predict the runtime of new instances using predict_runtime_with_model.py
options | default | description |
---|---|---|
--filename | required | the name of the .csv or .tsv file with the data |
--model_filename | required | the name of the .pkl of the model |
--plot_outfile | default='plot.png' | To get a plot you must also provide a runtime_label |
--runtime_label | default="runtime" | may be set to None. this specifies the label of the variable to predict in your dataset. may be left empty if you don't want prediction metrics |
--single_prediction | default=False | if you want a single prediction, setting this to True will return the prediction and nothing else |