is a small tool for doing data science work at the command
line. One core goal is to adhere to Unix principles regarding
input/output. This tool is intended for use in Unix pipelines.
I spend a lot of time at the command line doing DS work, and this tool largely replaces many of the simple scripts I used to use.
- Installing
- Usage
- Summary statistics
- K-Quintiles
- Model Evaluation
- XGBoost
- Correlation Matrix
- Extract Features
Building locally requires the rust tool chain (
When using cargo to install, make sure ~/.cargo/bin is in your $PATH.
> # install rustup toolchain
> # haven't confirmed, I think you'll need
> brew install libomp
> cargo install --path .
> # install rustup toolchain
# this is needed for xgboost
> sudo apt-get install llvm-dev libclang-dev clang
> cargo install --path .
> st --help
st 0.1
stat information and processing
-h, --help Prints help information
-V, --version Prints version information
cor-matrix Computes the Pearson correlation coefficient
eval evaluation metrics to score an output, confusion matrix and other helpful probablities. Note: all
classes need to be 0..N
extract data transformations and feature generation tools
graph very simple cli graphing
help Prints this message or the help of the given subcommand(s)
quintiles k-quintile from a single vector (default k = 5)
summary summary statistics from a single vector
xgb train, predict, and understand xgboost models
The --help works after any subcommand to display that subcommands info, flags, or options.
> cat tests/iris.csv | awk -F',' '{print $1}' |st summary -h
n min max mean median mode sd var
150 4.3 7.9 5.8433332 5.8 5 0.8253013 0.68112224
Or transpose the output.
> cat tests/iris.csv | awk -F',' '{print $1}' |st summary -ht
N 150
min 4.3
max 7.9
mean 5.8433332
med 5.8
mode 5
stdev 0.8253013
var 0.68112224
Simple way to get k-quintiles with the -q (5-quintile) and -Q k (where k is user defined) flags.
> cat tests/iris.csv | awk -F',' '{print $1}' |st quintiles -h -k 10
10% 4.8
20% 5
30% 5.3
40% 5.6
50% 5.8
60% 6.1
70% 6.3
80% 6.6
90% 6.9
Model evaluation is super important, and this subcommand contains some common tools for understanding your model.
Note: All classes are assumed to labeled 0..N. This starting at 0 and growing up is assumed in all the calculations within this section. Ensure your data is in this format or face the panics.
Additionally, all data passed into this subcommand is expected to be a
list of line separated tuples of the form predicted, actual
. Again
must be 0..N. Predicted in this case is an int or a (0,1)
value. In the case of a (0,1) range, this is rounded at 0.5 up or down
to the nearest int. To specify the threshold use the -t
> st eval iris_results.csv
- 0 1 2
0 7 1 0
1 0 9 0
2 0 0 8
There is a -v
flag which will provide the TPR and FPR rates for each
> st eval -v iris_results.csv
- 2 1 0
2 7 1 0
1 0 9 0
0 0 0 8
class tpr fpr tnr fnr
0 0.875 0.000 0.944 0.125
1 1.000 0.062 1.000 0.000
2 1.000 0.000 1.000 0.000
For binary and softmax objective functions. There is also a Bayes estimator of the models effective performance. This requires passing in a list of base rates of occurrence for a specific class. The length of the list much match the number of classes.
Apply Bayes formula, given a natual rate of occurance for the target class. In the example below, the natural rate of class_1 is very low. This answers the question: if the model predicts class_N, what is the probability that the predicted input is of class_N.
> st eval -v -b '0.99,0.01' results.csv
- 1 0
1 313 7
0 42 338
class tpr fpr tnr fnr
0 0.978 0.111 0.980 0.022
1 0.889 0.022 0.882 0.111
0: Pr(class_0 | positive) = 0.99885994
1: Pr(class_1 | positive) = 0.291144
A ROC curve in table form. For this the expected input is a list of
tuples of prediction, actual
where prediction is a range (0,1).
> st eval --table results.csv
- 1 0
1 57492 2465
0 2508 57535
t prec f1 tpr fpr
0.05 0.7785 0.9975 0.8745 0.2837
0.10 0.8343 0.9944 0.9074 0.1974
0.15 0.8688 0.9910 0.9259 0.1497
0.20 0.8922 0.9876 0.9375 0.1193
0.25 0.9087 0.9843 0.9449 0.0989
0.30 0.9215 0.9797 0.9498 0.0834
0.35 0.9330 0.9748 0.9534 0.0701
0.40 0.9429 0.9695 0.9560 0.0587
0.45 0.9516 0.9641 0.9578 0.0491
0.50 0.9589 0.9582 0.9585 0.0411
0.55 0.9653 0.9521 0.9587 0.0342
0.60 0.9714 0.9442 0.9576 0.0278
0.65 0.9760 0.9355 0.9553 0.0230
0.70 0.9808 0.9246 0.9519 0.0181
0.75 0.9852 0.9108 0.9465 0.0137
0.80 0.9898 0.8916 0.9381 0.0092
0.85 0.9935 0.8628 0.9235 0.0057
0.90 0.9962 0.8167 0.8976 0.0032
0.95 0.9985 0.7174 0.8349 0.0011
XGBoost is built in to st
. A simple workflow with the iris dataset is below.
XGBoost is usually the first model I start with when analyzing a dataset. Even
if XGBoost isn't the final model I'll be using in production, it is super easy
to train and most of all, interpret (well trees in general).
We're going to perform a binary prediction, however, there are three classes in this set. So we need to ensure we're using a multi-class predictive objective.
> cat tests/iris.csv |sed -e '1,1d' |tr -d '"' | awk -F',' '{print $5}' |sort |uniq
This command does a number of things. The first sed
command strips off the
header line as that gets in the way of the final random shuffle that happens.
The next few sed
commands changes the string labels to integers that XGBoost
can understand. We are left with two categories encoded with 0, 1, and 2. The
final command shuffles the entire dataset and prepares it for the train/test
> cat tests/iris.csv |sed -e '1,1d' |tr -d '"' | sed -e 's/setosa/0/g' | sed -e 's/versicolor/1/g' | sed -e 's/virginica/2/g' | sort -R > tests/iris_normalized.csv
> head tests/iris_normalized.csv
Split the dataset into a training set and a testing set. Since the dataset is random, we can just take the first N lines for the test, and the remainder as the training set.
> head -n 25 tests/iris_normalized.csv > tests/iris_test.csv
> cat tests/iris_normalized.csv | sed -e '1,25d' > tests/iris_train.csv
Next, now that the data is cleaned, we can train the model with XGBoost using all default parameters. The -y flag indicates which column is to be used as the predictor value. After the model is trained and saved, we can use it on our test set.
Training parameters can be tuned, such as eta and max depth. See st xgboost train --help
for more options.
> cat tests/iris_train.csv | st xgb train -n 3 -y 4 -m out.model -o multi:softmax
merror = 0.008
Now we can use the model to predict some values. Get the test set and use the predict subcommand. The predicted value for the test set is added as the first column of the output.
> cat tests/iris_test.csv | st xgb predict -m out.model
Tree based models are great for understanding the results. You can use the "importance" subcommand to try and understand the model and how specific features impact the model. All features are labeled "fx" where "x" is the column number.
> st xgb importance -t gain out.model
f2 = 0.49456635
f3 = 0.4888009
f1 = 0.011651897
f0 = 0.004981032
Computes the Pearson correlation coefficient matrix. In the example
below, the -y
flag is used because the CSV file still contains the
string labels in column 4.
> st cor-matrix -y 4 tests/iris_cleaned.csv
- 0 1 2 3
0: 1.00
1: -0.12 1.00
2: 0.87 -0.43 1.00
3: 0.82 -0.37 0.96 1.00
Frequently, a normalized byte histogram is desired from some input. This will output a 256 sized series to stdout, where each index is the decimal mapping for that specific byte.
> cat | st extract byte-histogram
0,0,0,0,0,0,0,0,0,0,0.027048063825647013, ...
Reducing the dimensionality of the data using the hash-trick is built in under the extract subcommand. Use the -F flag to set the delimiter.
> echo 'foo,bar,baz,raw,norm,etc' | st extract hash-trick -k 10 -b
Sometimes it is useful to use bit entropy as a feature.
> st extract entropy /bin/bash