The Safedocs File Observatory application is a tool meant to allow for the fast filtering and visualization of large document datasets based on low-level metadata that has been parsed and stored. Through this tool, a user has access to a variety of dynamic visualization tools to gain insight into specific fields by basic value counts as well as the ability to trigger more complex, multi-level, Elasticsearch queries.
The search view is where the bulk of operations happen. It is divided into two sections; a section for viewing individual document results returned from searches and a section dedicated to visualizing this data.
Searches are performed from the top search bar and can either be done in basic
or advanced
mode. In basic mode, terms you enter in the searchbar will be looked for across all fields in a document and the best matches will be returned. In advanced mode (triggered by clicking the angle brackets icon next to the search button), you have the ability to directly customize the query sent to elasticsearch. The refresh icon next to the advanced
mode button allows you to refresh a query if something isn't working as expected.
The default search query performed in basic mode:
{
"query": {
"bool": {
"must": {
"query_string": {
"query": "<your text here>",
"type": "best_fields"
}
}
}
}
}
In advanced mode, you can put any valid Elasticsearch query to search for documents, however, there are a few caveats to be aware of if you want data to show up from an advanced query in the Completion
and Similarity
tables below the Data Viz
tab. These tables rely on a suggest
query being performed initially with the specific names similarity-suggestion
an completion
for each table respectively. Once data from these queries are received, a backend process will automatically be triggered that will re-query each completion
result to get accurate counts. The easiest way to see these underlying queries and modify them is to perform an initial search and then switch to the advanced mode, which will provide you with a query similar to the following that you can then edit.
{
"query": {
"bool": {
"must": {
"query_string": {
"query": "test",
"type": "best_fields"
}
}
}
},
"suggest": {
"similarity-suggestion": {
"text": "test",
"term": {
"field": "q_parent_and_keys",
"suggest_mode": "always",
"sort": "frequency",
"size": 100,
"max_edits": 2,
"min_word_length": 2,
"max_term_freq": 2000000
}
},
"completion": {
"prefix": "test",
"completion": {
"field": "q_keys_and_values.completion",
"size": 2000,
"skip_duplicates": true
}
}
},
"size": 250,
"from": 0,
"sort": []
}
The other key way to search for documents is by using filters. Filters are available in a collapsed menu below the search bar and are meant to be stacked on top of basic search queries (filtering through the UI is not available in advanced search mode). The number to the right of a filter/field name represents the unique count of terms for a specific field returned along with your basic search query. This count is limited to the top 1000 unique terms by default, but if you expand a filter, you have the option to begin typing to further narrow down which terms you are looking for, which will perform subsequent queries to pull in more results.
Results returned from a search query will automatically be populated in the table below the search box. The number in the top left of the table will show the total count of results returned from a query, but since queries can be quite large, the table is virtualized and will by default only be populated with up to 250 documents initially. As you scroll through the table, subsequent queries will be performed to increase this number. All visible fields will be shown as columns in the table, but if you want to see all fields associated with a specific document, you can click the arrow next to the checkbox on the left side of a row.
Visible columns in the search table can be toggled in three ways. The first is by clicking the "eye" icon in the table header and manually selecting/unselecting which columns you want to see, which will be included or excluded immediately. This can be quite tedious if your goal is to unselect all columns or rapidly select a few very specific ones. For this purpose, you can also perform this selection process from either the Mapping
tab, which allows you to go through every Elasticsearch mapping field and control whether it is visible, or from the bottom of the Settings
tab, which mirrors the data in the other two locations, but allows for editing in a list format.
In addition to editing visibility of columns, you can also rearrange the order of columns. Similar to changing the visibility of columns, this can be done in three places. From the search table, you can drag and drop an individual column as needed for immediate updates. From the Mappings
tab you can perform this dragging rearrangement while editing all Mapping field properties. Lastly, from the bottom of the Settings
tab, you can drag and drop tags in a list arrangement.
The last type of modification you can do to columns in the search table is defining the sort order in which results are returned. This is done by hovering over a column name in the search table and clicking on the up or down arrow that appears next to it. Since the table is virtualized and not all results are immediately shown, changing the sort order requires another Elasticsearch query to be performed, so this might take a second.
On the top right of the search table, you will see an option for downloading either your current selection of documents or a variety of different sample sizes of documents from your query. In order for this feature to work however, there must be a clear mapping specified between search results and actual document files. This can be configured in the Settings
tab under Download Settings
. From this section, you can configure which field corresponds to the path of a document and you can either have a request be made to a remote API or you can select a local or mounted root folder containing the respective documents.
The data visualization section is the right half of the Search
tab. This section contains a sub-tab for dynamic count visualizations (Data Viz
), a visualization of significant terms for a respective field (Sig Terms
), a Trail of Bit's Polyfile Hex Editor integration (Hex Editor
), a crash analytics visualization (Crash Viz
), a geospatial coordinate plotting visualization (Geospatial
), and lastly a direct Kibana
integration.
The Data Viz tab contains a dynamic visualization that breaks down the counts of the top terms for a selected field in your given query. In the top left of this tab, you can select which field you want to visualize and which type of visualization you want to use. Currently there is support for a Donut
, Bar Chart
, and Treemap
visualization. In the Donut
visualization, clicking on a specific section will toggle the visibilty of the field label (which is useful for this visualization specifically so that labels don't overlap). Below the Data Viz
visualization section, there are 3 tables; a Count
table, a Similar Tokens
table, and a Completion
table.
The Count Table, unlike the above visualizations, includes all of the returned terms along with their counts rather than just the top ones. Additionally, clicking on the name of a term will toggle it's visibility in the above visualizations.
The Similarity table contains the results of the similarity suggestion query that is automatically performed alongside a basic
search and will be performed along with an advanced
search if a suggest query named similarity-suggestion
is included (see Advanced Search Query). The dropdown above this table allows for the configuration of which field to use for this query.
The Completion table, like the similar tokens table, contains the results of the completion suggestion query that is automatically performed alongside a basic
search and along with an advanced
search if a suggest query named completion
is included (again see Advanced Search Query). Each result in this table has been re-queried to provide accurate counts. Like the Similarity table, you can also configure which field to use for the query using the dropdown above the table. However, since this query specifically requires a completion
analyzer to specified in the Elasticsearch mapping, field options are limited to those that have this analyzer defined.
The Significant Terms tab contains a specialized visualization specifically for a Significant Terms query. This query first requires a field to be selected from the dropdown below the tab name and then will execute the following sub-sequent Elasticsearch query to determine which terms for that field are the most significant using chi-square:
{
"query": {
"query_string": {
"query": "q_keys:/.FontDescriptor/" // Example Query
}
},
"size": 0,
"aggregations": {
"my_sample": {
"sampler": {
"shard_size": 10000
},
"aggregations": {
"keywords": {
"significant_terms": {
"field": "tk_creator_tool.keyword", // Field selected: tk_creator_tool
"chi_square": {
"background_is_superset": false
}
}
}
}
}
}
}
Results for this query are returned in a Treemap visualization. It is also important to have this tab first open before searching to ensure your visualization will be shown.
The Hex Editor tab contains a graphical interface for the Trail of Bit's Polyfile tool. This can be used by either selecting a document from your computer or if you have the download settings configured (see Downloading), you can select documents in the search table and choose them from the Select Document
dropdown. Once a documnet is selected, make sure an output directory is also specified and then click Generate
. In the backend, this will spin up a Docker container that will run Polyfile on your document and generate an HTML hex editor for it as an output. This may take a few minutes to complete, but once it does, the Generate
button will be replaced with an Open
button to pop the Hex Editor out in a new window. You can open previously generated Hex Editor HTML files (that are in the output folder you specified) through the bottom dropdown.
The Crash Visualization is another specialized type of query visualization. This tab allows you to select a Crash Field
that contains a space delimited list of creator tool statuses (by default in the hosted dataset this is tools_status
) that looks like the following:
"c_crash cd_crash cpu_success mc_success pb_success pc_crash pid_success pinfo_success pr_success q_success tk_success xpf_warn"
Once a Crash Field
is selected, the next field to configure is the Creator Tool Field
. This is a field that contains the name of the tool that created the respective document. If this field is empty for a specific document, it will default to being called undefined
. The last field now to configure is what specific Creator Tool Name
you want to visualize. This dropdown lists the unique terms from the defined Creator Tool Field
. Once it is selected, a barchart visualization will be shown of the percentage of crashes for each tool.
The Geospatial visualization tab, as its name suggests, allows you to visualize latitude/longitude coordinates associated with a document. In the hosted data set, these are not necessarily origins of documents, but rather the location from which they were scraped/hosted. By default, the map will only show the documents that are currently loaded in the search table (250
if you haven't scrolled through the table). This can be changed using the Number of Points dropdown. If All
is selected, a specialized query will be made to search the entire dataset and cluster data together appropriately based on your zoom level. The Zoom Precision
dropdown will automatically update based on how far you are zoomed into a specific area on the map. However, if you want to increase precision and reduce clustering, you can manually increase this number.
The Kibana tab as its name implies includes an integrated Kibana
interface to view Elasticsearch data. After clicking on the tab name, you can again click on it to make it pop out in a separate window. If this tab is empty, it may mean that Kibana needs to be configured in the Settings
tab.
The mapping view provides a graphical way of searching through the Elasticsearch mapping for the configured Elasticsearch index. This view will show you what type
, analyzer
, and other properties each index has associated with it. Additionally, it allows you to customize whether a specific field is visualizable
(whether it will show as an option in the dropdowns for different visualizations), filterable
(whether it will show in the list of Filters), or visible
(whether the column will be visible in the table view). On the top right side of each field is a hamburger button which allows you to drag and rearrange the order of fields as well. This will update how the fields show in the search table.
The settings view provides a singular interface for configuring everything about the application. Many of the properties shown on this page are additionally configurable elsewhere, but are provided here for convenience. At the very top of the settings page are Export Config
and Import Config
icon buttons. This allows you to export or import all of your configured settings and share them with others.
The first section in the Settings page pertains to configuring your connection to ElasticSearch. There are two modes of connection; either using a passthrough API or directly specifying the address of ElasticSearch. The hosted dataset is provided through a passthrough API at https://api.safedocs.xyz/v1/elasticsearch/{INDEX}
(where {INDEX}
is a special string that is automatically replaced with the index, which is configured in the field below). Once the root Elasticsearch connection is configured, you can then enter the index you wish to connecto to in the field below. The ElasticSearch Index
serves as a unique field to which all other settings are associated. This means you can have different settings for different indices and click on the autocomplete textbox to switch between them. Once an Index is specified, you have all information necessary to use the tool in it's most basic capacity. The indicator button to the left of the Index field serves as a way to verify the specified Index and configured details are valid. Clicking this will run a basic query to verify. The Refresh button to the right of the Index field serves as a way to refresh the index mapping. If fields aren't showing up as expected below, click this button to verify the latest mapping is pulled.
This section contains the setttings for associating a document's path with the actual document to allow for downloading and use with tools such as the integrated Hex Editor. If Use Download API?
is toggled on, then specify an API endpoint that will return the downloaded files. Files are sent to the API via an added query string like so:
https://api.safedocs.xyz/v1/files?paths=file1,file2,file3,...
If Use Download API?
is toggled off, then choose the Raw File Location
, which serves as a root folder under which all files are stored. Currently, this is the preferred method of connection as it will work the fastest and most reliably. Lastly in this section, configure the Download Path Field
. This is the field that is associated with each document and contains the path to said document.
These settings are also available for configuration from the Search
tab and allow for field-based configuration of each special query.
This section is essentially a transpose of the Mapping tab and provides all of the same configuration options, just in 3 succinct and editable list views without associated metadata for each mapping field.
This application was built using Typescript, React, and Electron.js. To build locally first clone the repository and from the root folder run:
$ yarn install
To start the development version of the application after it has been installed run:
$ yarn electron:start
In order to make a local build of the application, run the follow:
$ yarn react-build && yarn electron-build
This will output compiled windows, mac, and linux builds of the application in the dist
folder.
If directly contributing to the repo, you can also make a release by running:
$ yarn build
This will prompt you to enter the new version number, which will automatically be added as a git tag, and then will proceed to generate a local build.
- Ryan Stonebraker, NASA JPL
- Mike Milano, NASA JPL
- Anastasija Mensikova, NASA JPL