Trufflehog accepting already parsed files
Closed this issue · 1 comments
Describe the bug
I noticed while testing the enhancement to scan postman workspaces that pydantic was throwing an error (Which is an issue in itself but not what this issue is focusing on)
TRACE bbot.core.event:logger.py:132 Traceback (most recent call last):
File "/home/user/bbot/bbot/core/event/base.py", line 214, in __init__
self.data = self._sanitize_data(data)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/bbot/bbot/core/event/base.py", line 642, in _sanitize_data
data = self._data_validator(**data).model_dump(exclude_none=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.cache/pypoetry/virtualenvs/bbot-HOgd0vRk-py3.11/lib/python3.11/site-packages/pydantic/main.py", line 176, in __init__
self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for _data_validator
host
Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/value_error
WARNING bbot.modules.trufflehog:base.py:1347 Error sanitizing event data "{'severity': 'High', 'description': "Verified Secret Found. Detector Type: [URI] Decoder Type: [PLAIN] Details: [{'Data': {'Filesystem': {'file': '/tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys/keys.txt', 'line': 1}}}] Raw result: [https://admin:admin@the-internet.herokuapp.com] RawV2 result: [https://admin:admin@the-internet.herokuapp.com/basic_auth]", 'host': ''}" for type "VULNERABILITY": 1 validation error for _data_validator
host
Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/value_error
TRACE bbot.modules.trufflehog:logger.py:132 Traceback (most recent call last):
File "/home/user/bbot/bbot/core/event/base.py", line 214, in __init__
self.data = self._sanitize_data(data)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/bbot/bbot/core/event/base.py", line 642, in _sanitize_data
data = self._data_validator(**data).model_dump(exclude_none=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.cache/pypoetry/virtualenvs/bbot-HOgd0vRk-py3.11/lib/python3.11/site-packages/pydantic/main.py", line 176, in __init__
self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for _data_validator
host
Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/value_error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/bbot/bbot/modules/base.py", line 435, in make_event
event = self.scan.make_event(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/bbot/bbot/scanner/scanner.py", line 964, in make_event
event = make_event(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/bbot/bbot/core/event/base.py", line 1655, in make_event
return event_class(
^^^^^^^^^^^^
File "/home/user/bbot/bbot/core/event/base.py", line 985, in __init__
super().__init__(*args, **kwargs)
File "/home/user/bbot/bbot/core/event/base.py", line 217, in __init__
raise ValidationError(f'Error sanitizing event data "{data}" for type "{self.type}": {e}')
bbot.errors.ValidationError: Error sanitizing event data "{'severity': 'High', 'description': "Verified Secret Found. Detector Type: [URI] Decoder Type: [PLAIN] Details: [{'Data': {'Filesystem': {'file': '/tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys/keys.txt', 'line': 1}}}] Raw result: [https://admin:admin@the-internet.herokuapp.com] RawV2 result: [https://admin:admin@the-internet.herokuapp.com/basic_auth]", 'host': ''}" for type "VULNERABILITY": 1 validation error for _data_validator
host
Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/value_error
TRACE bbot.core.event:logger.py:132 Traceback (most recent call last):
File "/home/user/bbot/bbot/core/event/base.py", line 214, in __init__
self.data = self._sanitize_data(data)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/bbot/bbot/core/event/base.py", line 642, in _sanitize_data
data = self._data_validator(**data).model_dump(exclude_none=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.cache/pypoetry/virtualenvs/bbot-HOgd0vRk-py3.11/lib/python3.11/site-packages/pydantic/main.py", line 176, in __init__
self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for _data_validator
host
Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/value_error
WARNING bbot.modules.trufflehog:base.py:1347 Error sanitizing event data "{'description': "Potential Secret Found. Detector Type: [URI] Decoder Type: [PLAIN] Details: [{'Data': {'Filesystem': {'file': '/tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys/keys.txt', 'line': 4}}}] Raw result: [https://admin:admin@internal.host.com] RawV2 result: [https://admin:admin@internal.host.com]", 'host': ''}" for type "FINDING": 1 validation error for _data_validator
host
Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/value_error
TRACE bbot.modules.trufflehog:logger.py:132 Traceback (most recent call last):
File "/home/user/bbot/bbot/core/event/base.py", line 214, in __init__
self.data = self._sanitize_data(data)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/bbot/bbot/core/event/base.py", line 642, in _sanitize_data
data = self._data_validator(**data).model_dump(exclude_none=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.cache/pypoetry/virtualenvs/bbot-HOgd0vRk-py3.11/lib/python3.11/site-packages/pydantic/main.py", line 176, in __init__
self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for _data_validator
host
Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/value_error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/bbot/bbot/modules/base.py", line 435, in make_event
event = self.scan.make_event(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/bbot/bbot/scanner/scanner.py", line 964, in make_event
event = make_event(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/bbot/bbot/core/event/base.py", line 1655, in make_event
return event_class(
^^^^^^^^^^^^
File "/home/user/bbot/bbot/core/event/base.py", line 985, in __init__
super().__init__(*args, **kwargs)
File "/home/user/bbot/bbot/core/event/base.py", line 217, in __init__
raise ValidationError(f'Error sanitizing event data "{data}" for type "{self.type}": {e}')
bbot.errors.ValidationError: Error sanitizing event data "{'description': "Potential Secret Found. Detector Type: [URI] Decoder Type: [PLAIN] Details: [{'Data': {'Filesystem': {'file': '/tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys/keys.txt', 'line': 4}}}] Raw result: [https://admin:admin@internal.host.com] RawV2 result: [https://admin:admin@internal.host.com]", 'host': ''}" for type "FINDING": 1 validation error for _data_validator
host
Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/value_error
While tracing back this error I noticed trufflehog is accepting the file within the already processed git folder 😲.
There is a filter to prevent this and we have a test to check the filter is working (Although the pydantic error is stopping the event from being emited so the tests pass)
Logs
A link to a recent test: https://github.com/blacklanternsecurity/bbot/actions/runs/11122673551/job/30904483667?pr=1811
So looking at the logs the filter correctly identifies that it has already parsed the folder in the TestTrufflehog
testcase
HUGEVERBOSE bbot.core.helpers.command:logger.py:132 run: /tmp/.bbot_test/tools/trufflehog --json --no-update --only-verified --concurrency=8 git file:///tmp/.bbot_test/scans/testtrufflehog_test_lajw5p4uku/git_repos/.bbot_test/test_keys
DEBUG bbot.modules.trufflehog:base.py:1235 Got FILESYSTEM("{'path': '/tmp/.bbot_test/scans/testtrufflehog_test_lajw5p4uku/git_repos/.bbot_t...", module=unstructured, tags={'distance-1', 'parsed-folder', 'file'}) from unstructured
DEBUG bbot.modules.trufflehog:base.py:1235 Not accepting FILESYSTEM("{'path': '/tmp/.bbot_test/scans/testtrufflehog_test_lajw5p4uku/git_repos/.bbot_t...", module=unstructured, tags={'distance-1', 'parsed-folder', 'file'}) because it did not meet custom filter criteria: Parent folder has already been processed
But in the TestTrufflehog_NonVerified
testcase it is bypassing the filter and ends up scanning them both
HUGEVERBOSE bbot.core.helpers.command:logger.py:132 run: /tmp/.bbot_test/tools/trufflehog --json --no-update --concurrency=8 filesystem /tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys/keys.txt
HUGEVERBOSE bbot.core.helpers.command:logger.py:132 run: /tmp/.bbot_test/tools/trufflehog --json --no-update --concurrency=8 git file:///tmp/.bbot_test/scans/testtrufflehog_nonverified_test_0t1aldutgx/git_repos/.bbot_test/test_keys
My working theory is that by chance the trufflehog module ends up consuming the file first and then the folder which the filter would not be able to prevent
As a fix for the trufflehog picking up already parsed files, I could filter out files with the tag parsed_folder
as this is what unstructured adds when it crawls a folder and re-raises the individual files as FILESYSTEM