blacklanternsecurity/bbot

BBOT v1.x vs v2.x SIEM Friendly JSON

Opened this issue · 11 comments

Describe the bug

The JSON output via the json and http output modules differs between BBOT v1.x and v2.x.

Even with siem_friendly=true set on both modules, there remains at least one fundamental type conflict, and one nested list in list that SIEM's and related tools may struggle to consume or utilise.

Expected behavior

Issue 1 - .data.SCAN type conflict

Many SIEM's require type consistency between documents, e.g. a field in a document that has been a string cannot in future suddenly become an object.

WIth BBOT v1.x siem_friendly=True output... .data.SCAN is a string, e.g.

{
  "type": "SCAN",
  "id": "SCAN:725368977d3a680e579707504e59428a7e3acc9d",
  "data": {
    "SCAN": "heinous_hermione (SCAN:725368977d3a680e579707504e59428a7e3acc9d)"
  },
  "scope_distance": 0,
  "scan": "SCAN:725368977d3a680e579707504e59428a7e3acc9d",
  "timestamp": 1709170919.403808,
  "source": "SCAN:725368977d3a680e579707504e59428a7e3acc9d",
  "tags": [
    "in-scope"
  ],
  "module": "TARGET",
  "module_sequence": "TARGET"
}

Whereas with BBOT v2.x siem_friendly=True output .data.SCAN is an object, e.g.

{
  "type": "SCAN",
  "id": "SCAN:b7b249df0e216908b4377509f50ac8092326b36b",
  "scope_description": "in-scope",
  "data": {
    "SCAN": {
      "id": "SCAN:b7b249df0e216908b4377509f50ac8092326b36b",
      "name": "devious_edna",
      "target": {
        "seeds": [
          "blacklanternsecurity.com"
        ],
        "whitelist": [
          "blacklanternsecurity.com"
        ],
        "blacklist": [],
        "strict_scope": false,
        "hash": "cffefd70a4eac5b8389a3c16987fb2ae91328c4c",
        "seed_hash": "29b7be19a3f7633571a48c40f320d465c918c26b",
        "whitelist_hash": "29b7be19a3f7633571a48c40f320d465c918c26b",
        "blacklist_hash": "da39a3ee5e6b4b0d3255bfef95601890afd80709",
        "scope_hash": "ef4a64d445f60a4ae47d81411d8994e40c3382d1"
      },

Issue 2 - List of List of Stuff

BBOT v2.x adds a new field "discovery_path". Output when siem_friendly=True is JSON as an list of list of strings.

This is technically valid JSON, however some SIEM's and related tools will struggle to consume or interact with it, and the ability to utilise the data via SIEM search functionality may be difficult or impossible even if the data is accepted by the SIEM, e.g. due to an inability to use discovery_path as part of searches because it has been ignored, or an inability to search to a specific position in the discovery path.

                "discovery_path": [
                    [
                        "DNS_NAME:1e57014aa7b0715bca68e4f597204fc4e1e851fc",
                        "Scan devious_edna seeded with DNS_NAME: blacklanternsecurity.com"
                    ],
                    [
                        "ORG_STUB:85519fabe82bd286159b7bfcf5c72139a563135b",
                        "speculated ORG_STUB: blacklanternsecurity"
                    ]
                ],

What would be ideal here is:

  1. siem_friendly=True JSON output structures should be consistent across versions, e.g. no type conflicts.
  2. Testing as part of standard BBOT tests using JSON schema (or similar method) to verify that no structure or type conflicts have been introduced in future
  3. No lists of lists if at all possible in JSON output, e.g. lists of simply types such a strings are fine, and lists of objects are fine, where lists of lists might be used an object based structure would provide a wider range of compatibility, e.g. something like.
                "discovery_path": {
                    "0": [
                        "DNS_NAME:1e57014aa7b0715bca68e4f597204fc4e1e851fc",
                        "Scan devious_edna seeded with DNS_NAME: blacklanternsecurity.com"
                    ],
                    "1": [
                        "ORG_STUB:85519fabe82bd286159b7bfcf5c72139a563135b",
                        "speculated ORG_STUB: blacklanternsecurity"
                    ]
                },

Or simply separating lists as per #1670

BBOT Command

Example: bbot -t elastic.co -p subdomain-enum -c output_modules.json.siem_friendly=true -om json

OS, BBOT Installation Method + Version

MacOS, poetry install/shell, BBOT v2.0.0

BBOT Config

n/a

Logs

As above.

Screenshots

n/a

For 1. - FYI - You should be able to handle both a string or an object via an ingest pipeline for Elastic - I can't say the same for other SIEMs. The initial dataset change consistently made them objects I believe for the fields that were going both ways. It helped before and a similar change can help again. But read on if we need to use both object and string.

For an Elasticsearch pipeline, you can check the field for a string using something like:

    {
      "rename": {
        "field": "bbot.data.SCAN",
        "target_field": "bbot.data.SCAN_string",
        "if": "ctx.bbot?.data?.SCAN  instanceof String"
      }
    },

So in theory you could handle both use cases in the same pipeline, but rather handle what you want to do with the output. The more the data is consistent for string vs object, obviously the easier it is to not have to catch these one offs, but I understand completely if it is justifiable for needing both.

  1. I like the suggestion there and great if that can be adopted, but if not, I would be happy to look into that as well.

I forgot to note, that this could ingest as is, however, it's not clean and just a keyword. So it is searchable and can be aggregated by the data within might not be useful that way.

I think one could use painless as the data resides today and combine them into a human readable form but also make them separate data points for other use cases such as most common paths.

As for the human readable I think we could combine the data and show something like:

DNS_NAME -> ORG_STUB -> X -> ...

And then doing the same with the data:

blacklanternsecurity.com -> blacklanternsecurity -> X -> ...

The information is there which is amazing, just need to align it as needed.

Thanks Nic. Yes I'm aware of how to rename fields via ingest pipeline.

It would actually have to be the instances of .data.SCAN aka .data.scan as objects that should be renamed.

e.g. what I was doing while trying to ensure compatibility was,

- rename:
    description: Fix type conflict between BBOT v1.x (string/keyword) and v2.x (object)
    field: bbot.data.SCAN
    target_field: bbot.data.scan_init
    ignore_missing: true

If we just rename string instances at .data.scan, but still allow an object to exist in .date.scan in other events, it can still cause type conflicts leading to ingest failure or search failures.

e.g. if someone starts sending BBOT v1.x and BBOT v2.x logs in simultaneously, indexing failures will occur depending on the index template that the active write index is using.

e.g. if an Elastic instance has older BBOT v1.x data in indices that had an index template where .data.scan is a string type, and they upgrade the integration and start ingesting BBOT v2.x data, there will be an index rollover event at install time. As such there will be a type conflict between the older indices where .data.scan is a string, and the new active write index where .data.scan is an object. Any new BBOT v1.x data sent in will fail to be indexed. BBOT v2.x data will be successfully indexed however. And this situation leads to, at best, partial search results.

Though to be fair, this really just means SCAN events are the only one that fails to get indexed or returned as a search result, everything else will be ingested fine and should come back in search results.

Are you actively ingesting BBOT output data into Elastic? Can you perhaps offer your opinion on whether guaranteed backwards compatibility is required? Would you realistically be ingesting BBOT v1.x and v2.x results at the same time?

I've got a integrations repo fork in progress here, https://github.com/routedlogic/integrations/tree/bbot-v2

This currently doesn't include support for BBOT v1.x data, and after some other back and forth with TheTechromancer it seems like not worrying about backwards compatibility might be the approach.

That said I can rework what I've done again quickly to support both by adding this sub-ingest pipeline again to handle v2.x specific data.

The issue with .data.discovery_path definitely isn't a show stopping issue, simply marking that as dynamic: true allows elastic-agent test to complete. The way in which Elastic handles type detection for the field does lead to a collapse of the list in list down to a single plain list from a search perspective. The sequence context gets lost but it's still usable for search and visualisations.

Good thoughts.

I was thinking you could make it a string but use the target of data.SCAN something completely different like SCAN_details to keep backwards compatibility and not have any conflicts since 2.0 hasn't been used yet. But I like the thought of deprecating 1.x if it won't be supported any more.

We wouldn't ingest both version but plan on using latest and not looking back.

We can also state in the integration that it only supports 2.x and if someone wants to use an older version of BBOT they need 0.2.0.

We need not backwards compatibility for what it's seemingly like with the amount of effort required.

Hope that helps.

Also, I like the fork you have been working on so far. Lots of good improvements.

Any updates here?

The changes are merged into dev, and are waiting to be merged into stable for version 2.1.0.

#1670

@nicpenning @CarsonHrusovsky @TheTechromancer

I've issued the PR to elastic/integrations to pull in my BBOT v2.x enhancements.

I've re-tested extensively with BBOT v2.1.2 today, no issues.

PR: elastic/integrations#11742
Fork for PR: https://github.com/colin-stubbs/integrations/tree/bbot-v2

This is AWESOME - excited to try this out (I guess its time to upgrade to BBOT 2.0 😄) - great work Colin

Thanks for your work on this @colin-stubbs.

Recently as I've been working on BBOT server, I've started to realize how big of an issue the .data type descrepancy is, and how nice it would be to have it be consistent by default, without the need for siem_friendly=True.

This would probably mean separating out .data into two different attributes, e.g. .data_str and .data_json. I know we talked about this before.

Honestly, nesting like this:

{
  "type": "DNS_NAME",
  "data": {
    "DNS_NAME": "blacklanternsecurity.com"
  }
}

...feels kind of awkward, because in order to get to the data, you have to know what type of event it is. What do you guys think? Is changing that again going to cause a huge headache?

Kind of a problem, but we could release a new major Elastic integration version that explicitly drops all support for prior formats.

Or an entirely new integration which uses different indices and field prefix for bbot data.

An integration could actually be irrelevant if the bbot server is using Elastic for storage and is handling collection and delivery of all scan results etc to it. Assuming that at least scan/time series data is stored in indices/data streams with the standard naming convention.

Okay, yeah my thought is to have an "event store", which could be any type of database including Postgres, Mongo, Elastic, etc. Each of these would have a dedicated output module that would format the data, build the correct indexes, etc.

Then BBOT server will monitor the event store (which is basically a "time machine" of events), and automatically aggregate/present the data in a nice way, alert on new vulns, etc.

@colin-stubbs we have two options. We could either make the change right now and sneak it into your current elastic PR. Or we could save it for BBOT 3.0, which would be released alongside BBOT server. Either one is okay with me.