blacklanternsecurity/bbot

Trufflehog accepting already parsed files

Closed this issue · 1 comments

Describe the bug
I noticed while testing the enhancement to scan postman workspaces that pydantic was throwing an error (Which is an issue in itself but not what this issue is focusing on)

TRACE    bbot.core.event:logger.py:132 Traceback (most recent call last):
  File "/home/user/bbot/bbot/core/event/base.py", line 214, in __init__
    self.data = self._sanitize_data(data)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/bbot/bbot/core/event/base.py", line 642, in _sanitize_data
    data = self._data_validator(**data).model_dump(exclude_none=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.cache/pypoetry/virtualenvs/bbot-HOgd0vRk-py3.11/lib/python3.11/site-packages/pydantic/main.py", line 176, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for _data_validator
host
  Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

WARNING  bbot.modules.trufflehog:base.py:1347 Error sanitizing event data "{'severity': 'High', 'description': "Verified Secret Found. Detector Type: [URI] Decoder Type: [PLAIN] Details: [{'Data': {'Filesystem': {'file': '/tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys/keys.txt', 'line': 1}}}] Raw result: [https://admin:admin@the-internet.herokuapp.com] RawV2 result: [https://admin:admin@the-internet.herokuapp.com/basic_auth]", 'host': ''}" for type "VULNERABILITY": 1 validation error for _data_validator
host
  Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error
TRACE    bbot.modules.trufflehog:logger.py:132 Traceback (most recent call last):
  File "/home/user/bbot/bbot/core/event/base.py", line 214, in __init__
    self.data = self._sanitize_data(data)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/bbot/bbot/core/event/base.py", line 642, in _sanitize_data
    data = self._data_validator(**data).model_dump(exclude_none=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.cache/pypoetry/virtualenvs/bbot-HOgd0vRk-py3.11/lib/python3.11/site-packages/pydantic/main.py", line 176, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for _data_validator
host
  Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/bbot/bbot/modules/base.py", line 435, in make_event
    event = self.scan.make_event(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/bbot/bbot/scanner/scanner.py", line 964, in make_event
    event = make_event(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/bbot/bbot/core/event/base.py", line 1655, in make_event
    return event_class(
           ^^^^^^^^^^^^
  File "/home/user/bbot/bbot/core/event/base.py", line 985, in __init__
    super().__init__(*args, **kwargs)
  File "/home/user/bbot/bbot/core/event/base.py", line 217, in __init__
    raise ValidationError(f'Error sanitizing event data "{data}" for type "{self.type}": {e}')
bbot.errors.ValidationError: Error sanitizing event data "{'severity': 'High', 'description': "Verified Secret Found. Detector Type: [URI] Decoder Type: [PLAIN] Details: [{'Data': {'Filesystem': {'file': '/tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys/keys.txt', 'line': 1}}}] Raw result: [https://admin:admin@the-internet.herokuapp.com] RawV2 result: [https://admin:admin@the-internet.herokuapp.com/basic_auth]", 'host': ''}" for type "VULNERABILITY": 1 validation error for _data_validator
host
  Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

TRACE    bbot.core.event:logger.py:132 Traceback (most recent call last):
  File "/home/user/bbot/bbot/core/event/base.py", line 214, in __init__
    self.data = self._sanitize_data(data)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/bbot/bbot/core/event/base.py", line 642, in _sanitize_data
    data = self._data_validator(**data).model_dump(exclude_none=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.cache/pypoetry/virtualenvs/bbot-HOgd0vRk-py3.11/lib/python3.11/site-packages/pydantic/main.py", line 176, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for _data_validator
host
  Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

WARNING  bbot.modules.trufflehog:base.py:1347 Error sanitizing event data "{'description': "Potential Secret Found. Detector Type: [URI] Decoder Type: [PLAIN] Details: [{'Data': {'Filesystem': {'file': '/tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys/keys.txt', 'line': 4}}}] Raw result: [https://admin:admin@internal.host.com] RawV2 result: [https://admin:admin@internal.host.com]", 'host': ''}" for type "FINDING": 1 validation error for _data_validator
host
  Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error
TRACE    bbot.modules.trufflehog:logger.py:132 Traceback (most recent call last):
  File "/home/user/bbot/bbot/core/event/base.py", line 214, in __init__
    self.data = self._sanitize_data(data)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/bbot/bbot/core/event/base.py", line 642, in _sanitize_data
    data = self._data_validator(**data).model_dump(exclude_none=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.cache/pypoetry/virtualenvs/bbot-HOgd0vRk-py3.11/lib/python3.11/site-packages/pydantic/main.py", line 176, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for _data_validator
host
  Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/bbot/bbot/modules/base.py", line 435, in make_event
    event = self.scan.make_event(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/bbot/bbot/scanner/scanner.py", line 964, in make_event
    event = make_event(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/bbot/bbot/core/event/base.py", line 1655, in make_event
    return event_class(
           ^^^^^^^^^^^^
  File "/home/user/bbot/bbot/core/event/base.py", line 985, in __init__
    super().__init__(*args, **kwargs)
  File "/home/user/bbot/bbot/core/event/base.py", line 217, in __init__
    raise ValidationError(f'Error sanitizing event data "{data}" for type "{self.type}": {e}')
bbot.errors.ValidationError: Error sanitizing event data "{'description': "Potential Secret Found. Detector Type: [URI] Decoder Type: [PLAIN] Details: [{'Data': {'Filesystem': {'file': '/tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys/keys.txt', 'line': 4}}}] Raw result: [https://admin:admin@internal.host.com] RawV2 result: [https://admin:admin@internal.host.com]", 'host': ''}" for type "FINDING": 1 validation error for _data_validator
host
  Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

While tracing back this error I noticed trufflehog is accepting the file within the already processed git folder 😲.
There is a filter to prevent this and we have a test to check the filter is working (Although the pydantic error is stopping the event from being emited so the tests pass)

Logs
A link to a recent test: https://github.com/blacklanternsecurity/bbot/actions/runs/11122673551/job/30904483667?pr=1811

So looking at the logs the filter correctly identifies that it has already parsed the folder in the TestTrufflehog testcase

HUGEVERBOSE bbot.core.helpers.command:logger.py:132 run: /tmp/.bbot_test/tools/trufflehog --json --no-update --only-verified --concurrency=8 git file:///tmp/.bbot_test/scans/testtrufflehog_test_lajw5p4uku/git_repos/.bbot_test/test_keys
DEBUG    bbot.modules.trufflehog:base.py:1235 Got FILESYSTEM("{'path': '/tmp/.bbot_test/scans/testtrufflehog_test_lajw5p4uku/git_repos/.bbot_t...", module=unstructured, tags={'distance-1', 'parsed-folder', 'file'}) from unstructured
DEBUG    bbot.modules.trufflehog:base.py:1235 Not accepting FILESYSTEM("{'path': '/tmp/.bbot_test/scans/testtrufflehog_test_lajw5p4uku/git_repos/.bbot_t...", module=unstructured, tags={'distance-1', 'parsed-folder', 'file'}) because it did not meet custom filter criteria: Parent folder has already been processed

But in the TestTrufflehog_NonVerified testcase it is bypassing the filter and ends up scanning them both

HUGEVERBOSE bbot.core.helpers.command:logger.py:132 run: /tmp/.bbot_test/tools/trufflehog --json --no-update --concurrency=8 filesystem /tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys/keys.txt
HUGEVERBOSE bbot.core.helpers.command:logger.py:132 run: /tmp/.bbot_test/tools/trufflehog --json --no-update --concurrency=8 git file:///tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys

My working theory is that by chance the trufflehog module ends up consuming the file first and then the folder which the filter would not be able to prevent

As a fix for the trufflehog picking up already parsed files, I could filter out files with the tag parsed_folder as this is what unstructured adds when it crawls a folder and re-raises the individual files as FILESYSTEM