Cyobstract is a cyber observables extraction tool that uses regular expressions on cyber incident reports. It quickly pulls indicators and other cyber information from these reports. It takes free text as input and provides relevant information for incident response (IR) in a structured format as output.
Cyobstract is an IR tool built to support an exploratory inquiry that the CERT Division of the Software Engineering Institute (SEI) performed on a dataset of Department of Homeland Security (DHS) incident reports.
You can feed Cyobstract with the text of your own incident reports—either from an extract of incident reports from a ticketing system or another source (e.g., email, text files, a database, etc.).
Cyobstract applies a list of regular expressions to catch commonly occurring data types and values inside of free text. It also matches on many styles of defanged indicators. (Defanging obfuscates indicators into safer representations so that a user doesn't accidentally click on a malicious URL or inadvertently run malicious code.)
There are two parts to Cyobstract:
-
the Cyobstract standard package which can be installed and imported in python then called with
cyobstract.extract
andcyobstract.trie
-
the suite of developer tools that can be used to build a custom extraction module
The developer tools are covered in the Using the Cyobstract Developer Tools section below.
Important: Cyobstract is a beta project that makes no guarantees regarding stability or functionality.
There are two main features of Cyobstract: extracting indicators/observables and handling defanged indicators.
Cyobstract extracts 24 incident-response-related data types from free text incident reports, such as the commonly used 'incident_description' field:
- IP addresses—IPv4, IPv4 CIDR, IPv4 range, IPv6, IPv6 CIDR, and IPv6 range
- hashes—MD5, SHA1, SHA256, and ssdeep
- Internet and system-related strings—FQDN, URL, user agent strings, email address, filenames, filepath, and registry key
- Internet infrastructure values—ASN, ASN owner, country, and ISP
- security analysis values—CVE, malware, and attack type
Cyber incident responders deal with malicious computer code and, by necessity, they use computers to analyze those samples. It is difficult to simultaneously maintain a safe environment, a high functionality environment, and an environment that allows the rapid communication of results. The IR community adopted a practice called defanging to reduce the chances of accidentally infecting their own (or others’) computers.
Defanging obfuscates indicators into a safer representations so that a user doesn't accidentally click on a malicious URL or inadvertently run malicious code. Defanging means changing data values to deliberately violate Internet protocols. Unfortunately, there is no universal standard for defanging, although there are some common methods. There is even a Python module that can be used to defang certain data types, but not all teams use it.
Typical types of defanged data include IP addresses, fully qualified domain names (FQDNs), email, and file extensions. Some samples of defanging we have observed include the following:
- www dot cert dot org (www.cert.org)
- www[.]cert[.]org
- www[.cert[.org
- www{.}cert{.}org
- incidents at cert dot org (incidents@cert.org)
Our extraction module successfully recognizes and extracts many forms of defanged indicator values.
To use the prepackaged modules, follow the installation instructions below to install the Cyobstract standard package. Details on how to use the modules and what they do are also provided below.
To use the Cyobstract developer tools, clone the source repository and use them directly from that location. To use the tools, you must set up a configuration file that specifies your data set of incident reports (or tickets). Configuration details are in the Configuration subsection of the Using the Cyobstract Developer Tools section below.
Using either the standard package or the developer tools, Cyobstract uses a set of regular expressions to extract 24 information types from the free text in your incident reports. Using the standard package, the extracted fields are returned to your application. Using the developer tools, the extracted fields are stored to disk while being used.
Install Python 2.7 or 3.6.
Clone the repository:
$ git clone https://github.com/cmu-sei/cyobstract.git
If you do not have pandas, install it with pip install pandas
.
Move to the cyobstract directory, run python setup.py install
.
The dependencies future
, and progress
should have installed after running setup.py
, but if they failed, you can install them using these commands:
pip install future
pip install progress
If you are using the developer tools, there are several dependencies that are detailed in the Using the Cyobstract Developer Tools section below.
There are two functions in the cyobstract.extract
module. Both functions take an arbitrary string of text as input (for example the 'description' field from a given incident report).
The extract.extract_observables
function will extract all recognized types of artifacts from the input and return the results in a dictionary (type : values).
from cyobstract import extract
text = # source of text
results = extract.extract_observables(text)
print(results) # for example
You can also extract for a particular type of observable using extract.extract
:
text = # source of text
for observable in extract.extract(text, observable_type):
print(observable)
This module contains a function that constructs optimized regular expressions based on a list of tokens:
from cyobstract import trie
tokens = # list of tokens
re_str = trie.re_str_from_tokens(tokens)
The details of this module are explored in more detail in Appendix B.
The developer tools are useful for developing and refining your own regular expressions and cataloging their results on your incident reports.
As with the basic installation, use either Python 2.7 or 3.6. Install the following:
pip install setuptools # at least version 28.05.0 (https://stackoverflow.com/a/40477249)
pip install pandas
pip install urllib
pip install progress
$ git clone https://code.sei.cmu.edu/bitbucket/projects/USCID/repos/cyobstract/browse
Configuration is specified in a YAML file located at ~/.cyobstract
. There are currently only three variables that can be defined. Here is an example template:
# repository path (where generated data lives). If undefined,
# it will default to the code repository root. Subdirectories
# are created here (e.g. 'dat', 'log', 'tmp'). If an 'etc'
# directory is here, it will be used, otherwise the 'etc'
# directory from the code repository is used.
repo_path:
# database URI
# e.g. /path/to/sqlite3_file
db_uri: '/database/uri'
# database driver as registered to the smoke.db module
# (custom drivers can be written to match your schema,
# see APPENDIX C)
db_driver: 'db1'
The regular expressions (and functions) used to extract indicators live in the cyobstract.extract
module. They consist of manually constructed regular expressions and optimized regexes that are automatically generated from collections of external data in the ./etc
directory. These regular expressions can be further refined and new indicator types and their corresponding regexes can be added to the analysis suite.
For more information on the tool suite that helps with developing new extractions as well as the components of the developer tools, see Appendix A. For details on cyobstract.trie
, the module used for building optimized regular expressions, see Appendix B.
You can read more about Cyobstract in some of our other publications
-
FIRST Conference Presentation: Extracting Indicators from Incident Reports (June 2017)
-
SEI Blog Post Improving Data Extraction from Cybersecurity Incident Reports (September 2017)
Please share your stories about using Cyobstract with us!
To learn more about Cyobstract, see the above publications or contact Sam Perl at the CERT Division of the Software Engineering Institute.
Cyobstract developer tools consist of a set of scripts and data. The tools include utilities for building new regexes, exploring the corpus of incident reports, and benchmarking the resulting iterations of extractions as they are refined.
The regular expressions can be compiled regexes or a callable function; in either case, these expressions should return a list of indicators found throughout the entire block of text provided to it. Regular expressions consist of a combination of manually constructed regexes/functions, automatically generated regexes, and hybrids thereof. For compiled regexes, as opposed to callables, you want only one set of capturing parentheses since eventually .findall()
will be invoked on it.
The manually constructed regular expressions (and functions) reside in extract/regex.py
. Often, these expressions are built from primitives to improve readability.
The automatically generated regexes reside in extract/re_auto.py
, which in turn is imported into the main extract module. Since these regular expressions can be very large and complex, they are kept in a separate module to improve the readability of the main module.
Whether manual or automatic, the compiled regex or callable is stored in the regex.regexes
dictionary under the name of the type of indicator being extracted.
Automatically generated regular expressions are derived from lists of tokens that live in data files in the ./etc
directory. Some of these files are manually constructed and some are just raw lists of tokens downloaded from publicly available sources on the Internet.
Sometimes there are also lists of tokens to ignore since we have found some tokens generate too many false positives. The scripts that build the regexes from tokens live in the ./bin
directory.
Below are the current token files, how they are constructed, and the script that builds their respective regexes:
Generated File | Data Method | Build Script |
---|---|---|
country_codes.txt | downloaded | build_cc_re |
country_adjectivals.txt | downloaded | build_cc_re |
malware.txt | manual | build_malware_re |
file_exts.txt topic.txt | manual | build_exts_re |
html_entities.txt | downloaded | build_entity_re |
tlds.txt | downloaded | build_tld_re |
The scripts take no arguments and print the resulting regular expression to STDOUT. The regexes are grouped using non-capturing parentheses so, when they are eventually compiled, they will need to be embedded in capturing parentheses. Below is an example:
In extract/re_auto.py
my_raw_re = """
<pasted result from script output>
""".strip()
Then in extract/regex.py
import re
from . import re_auto
...
# this is just an example; you might not want word
# boundaries or you might want different regex options.
# You *do* want capturing parentheses.
my_re = re.compile("\b(%s)\b" % re_auto.my_raw_re, re.X)
In all cases, the scripts use the cyobstract.trie
module to generate the optimized regexes. Details of how that module works can be found in Appendix B.
While refining and developing new regular expressions, it can be helpful to explore the collection of incident reports to help you design more precise (or permissive) future iterations. There are several scripts that facilitate this process. Once interesting new ideas or refinements are found, they can be integrated into the standard extraction library.
Note: A few of these scripts use the multiproccess
module for parallel execution, which is why if you write your own database driver (see APPENDIX C) it is important to track database connections by process ID.
Below is a list of these scripts and descriptions of what each one does. All scripts take -h
or --help
to access help that explains the details of their options.
This script allows the testing of arbitrary regular expressions across specific incident reports, or the entire corpus, typically in random order. Whenever a hit occurs, the incident ID, line number, and incident type are printed to STDOUT along with the line in question. Below is a typical usage for this script:
grep_entry -r "regex"
This script is similar to grep_entry
, except it uses the current extraction types that exist in cyobstract.extract
. It can operate on specific incident reports or the entire corpus. In addition, you can restrict the types of indicators that are used for the search. Below is a typical usage for this script:
grep_entry -r type_1 type_2
This script simply dumps the given incident reports to STDOUT. Below is a typical usage for this script:
cat_entry incident_id
The bin/bulk_query
script is the real workhorse of the testing suite. It runs every type of defined extraction across the entire corpus of incident reports. Results are saved and compared to prior runs. These results capture only hits, including true positives and false positives. It does not detect false negatives. When the script finishes running, some statistics are printed to STDOUT that detail what changed between the latest run and the prior run.
The complete results are stored in dat/extractions/<timestamp>
. For convenience, two soft links are created, previous
and current
, that point to their respective timestamped directories.
In each results directory, a text file is created with the type of indicator that it represents. Each of these files is made up of lines with the incident ID and the artifact that was extracted. It is useful to examine the differences between the current run and previous run using a tool such as diff
.
Unlike the bulk query tool, the smoke testing suite can specifically detect true positives, false positives, and false negatives. However, to achieve these results, some preparatory work is required. A collection of representative incident reports must be selected and manually parsed for artifacts. Once this is accomplished, the bin/smoke_test
script runs the extractions on each report, prints some statistics to STDOUT that represents the aggregate difference between the latest run and the prior run, and saves more detailed results in the dat/out
directory. Three files and three soft links are created:
<timestamp>.false_neg.txt
<timestamp>.false_pos.txt
<timestamp>.true_pos.txt
false_neg.txt -> <timestamp>.false_neg.txt
false_pos.txt -> <timestamp>.false_pos.txt
true_pos.txt -> <timestamp>.true_pos.txt
Representative incident reports must first be selected. These reports can be selected either via manual inspection or through exploratory tools such as bin/grep_entry
. Once selected, however, the following steps must be taken:
-
Run the command
bin/grab_entry <incident_id>
to save the text of the incident report indat/iid/<incident_id>/notes.txt
. -
(optional but helpful) Run the command
bin/draft_knowns <incident_id>
to create the filedat/iid/<incident_id>/knowns_draft.txt
. This file is a report of positive matches (both true and false) delineated by line number and the artifact that was extracted. It can be used as a starting point for the next step. After manual parsing, false positives should be deleted and false negatives should be added. -
Manually parse the incident report and store artifacts that should be extracted in
dat/iid/<incident_id>/knowns.txt
. The conventional formatting is "line_no: artifact".
These three steps should be repeated as often as necessary to get a robust and representative sampling of incident reports. Once these are in place, improvements and regressions while developing the regular expressions can be precisely measured.
The trie
module is used to construct optimized regular expressions from lists of tokens. It is used by the developer tools, but can also be used as a general purpose tool for building any regular expression.
First, print out the generated regular expression:
from cyobstract import trie
tokens = # wherever the list of tokens comes from
re_str = trie.re_str_from_tokens(tokens)
print(re_str)
Then pass that result somewhere, such as my_regex.py
:
re_str = """
<pasted regex>
""".strip()
Then in the module where you use it:
import re
import my_regex
# need capturing parenthesis in order to extract anything;
# the word boundary (\b) expressions are just an example
# of something that might surround the larger regex
my_re = re.compile(r"\b(%s)\b" % my_regex.re_str, re.U|re.X)
Let's say you have a list of words or phrases, which we'll call tokens. For example, here are four:
dog
dingo
cat
doggo
The naive way of constructing a regex to match those is something like this:
(?:dog|dingo|cat|doggo)
The optimizer, on the other hand, constructs a regex based on a prefix map, where shared prefixes are encoded a minimal number of times. So, the above becomes this:
(?:cat|(?:d(?:og(?:go)?|ingo)))
What this optimization ends up doing is minimizing the amount of backtracking the regex engine needs to do whenever a match fails. Note that (?:)
denotes a non-capturing group. Also, the "go" at the end of doggo is encoded as optional.
This kind of optimization can be done by hand if you're good with regexes, but where the tool really shines is when your original list comprises dozens or even hundreds of tokens.
Another helpful feature is that regex constructs can be embedded in the tokens. For example, with quick\s+brown\s+fox
the whitespace expressions \s+
are treated as a single atom and are preserved in the resulting optimized aggregate expression.
Earlier in this file, we describes how a database driver can be specified in the ~/.cyobstract
configuration file (e.g., db_driver: db1
). Cyobstract ships with one drivers that uses a database schemas: db1
.
smoke/db/db_driver_1.py
If this schema does not suits your needs, you can add your own. The easiest way to start is to examine the code in this files and modify it accordingly. There are are a three essential things to keep in mind:
-
The driver module must live in the
smoke/db
directory, and the filename must begin with db_driver. -
The driver modules are responsible for managing their own database connection pool. The database URI can be specified in
~/.cyobstract
but could also be hard coded. Connections should be tracked by process ID. -
The driver modules must register themselves with the main
smoke.db
module. Below is an example:
import sys
from smoke import db
...
db.register_driver('driver_name', sys.modules[__name__])
Once registered, 'driver_name' can be used in ~/.cyobstract
for the db_driver field.