Dashlane/dbt-invoke

Windows emits Console Virtual Terminal Sequences which causes output parsing to fail

Closed this issue · 5 comments

Any Windows pros around?
I am not entirely sure why but running dbt commands can add characters that look like Console Virtual Terminal Sequences (https://docs.microsoft.com/en-us/windows/console/console-virtual-terminal-sequences) which will cause the parsing of the command output to fail.
stdout will then look something like

{"resource_type": "model", "depends_on": {"macros": ... }}
�[0m

Happened on a Windows 10.

Recreate:
This script caused the error:

from dbt_invoke.properties import update
from invoke.context import Context

ctx = Context()
update(ctx,
       resource_type='model',
       project_dir="PATH",
       select="users",
       log_level="DEBUG")

Solution
We can use a simple regex to parse out any unwanted characters. Something like proposed here: https://stackoverflow.com/a/38662876

PR Should be done soon.

Thank you, @ciklista, for submitting this issue and also for PR #17. I am going to try to reproduce this output on a Windows machine (tried on Mac, but as expected I didn't get an error). The PR looks straightforward and I'm definitely not a Windows expert, but I would like to spend a little bit of time to see I can get a better grasp for the root cause before merging the fix.

Are you able to post the error message here also?

The error I get is (sorry for the german error message)

Traceback (most recent call last):
  File "C:\Users\\code\dbt-invoke\dbt_invoke\testomatiko.py", line 5, in <module>
    update(ctx,
  File "C:\Users\PycharmProjects\dbt-invoke\venv\lib\site-packages\invoke\tasks.py", line 127, in __call__
    result = self.body(*args, **kwargs)
  File "C:\Users\\code\dbt-invoke\dbt_invoke\properties.py", line 101, in update
    common_dbt_kwargs, transformed_ls_results = _initiate_alterations(
  File "C:\Users\code\dbt-invoke\dbt_invoke\properties.py", line 236, in _initiate_alterations
    transformed_ls_results = _transform_ls_results(
  File "C:\Users\code\dbt-invoke\dbt_invoke\properties.py", line 292, in _transform_ls_results
    if Path(ctx.config['project_path'], potential_result_path).exists():
  File "C:\Program Files\Python39\lib\pathlib.py", line 1407, in exists
    self.stat()
  File "C:\Program Files\Python39\lib\pathlib.py", line 1221, in stat
    return self._accessor.stat(self)
OSError: [WinError 123] Die Syntax für den Dateinamen, Verzeichnisnamen oder die Datenträgerbezeichnung ist falsch: 'C:\\[...]\\code\\dbt2\\\x1b[0m'

Process finished with exit code 1

The wrong line is being added to the dbt ls output here and thus treated as a resource with a corresponding yml file.

for line in result_lines:
# Because we set the dbt global arg "--log-format json", if
# line is valid json then it may be an actual result or it
# may be some other output from dbt, like a warning.
try:
line_dict = json.loads(line)
# If line is not valid json, then it should be an actual
# result. This is because even when the "dbt ls" command
# arg "--output" is not set to json, non-result logs will
# still be in json format (due to the dbt global arg
# "--log-format json").
except ValueError:
result_lines_filtered.append(line)
continue
# If 'resource_type' is in line_dict, then this is likely
# an actual result and not something else like a warning.
if 'resource_type' in line_dict:
result_lines_filtered.append(line_dict)
# Else, if 'resource_type' is not in line_dict, this may be
# a warning from dbt, so log it.
else:
logger.warning(f'Extra output from "dbt ls" command: {line}')
return result_lines_filtered

@ciklista, thanks for adding the error message (no need to apologize for knowing multiple languages!). It helped me understand better.

I have an alternate suggestion that I would like to get your opinion on.

I notice the error message is referencing line 290 from this part of properties.py:

for i, potential_result in enumerate(potential_results):
if 'original_file_path' in potential_result:
potential_result_path = potential_result['original_file_path']
# Before dbt version 0.20.0, original_file_path was not
# included in the json response of "dbt ls". For older
# versions of dbt, we need to run "dbt ls" with the
# "--output path" argument in order to retrieve paths
else:
if potential_result_paths is None:
potential_result_paths = _utils.dbt_ls(
ctx,
supported_resource_types=_SUPPORTED_RESOURCE_TYPES,
logger=_LOGGER,
output='path',
**kwargs,
)
assert len(potential_result_paths) == len(
potential_results
), 'Length of results differs from length of result details'
potential_result_path = potential_result_paths[i]
if Path(ctx.config['project_path'], potential_result_path).exists():
results[potential_result_path] = potential_result

There is an error during Path(ctx.config['project_path'], potential_result_path).exists() because the Path not only doesn't exist, but is invalid (because of the ANSI escape sequence issue you identified) as seen near the end of your error message: ... 'C:\\[...]\\code\\dbt2\\\x1b[0m'

What if instead of filtering escape sequences with a regex, we handled all invalid paths, regardless of platform, by checking whether the path is valid before checking whether it exists? If the path is invalid we can trigger a debug log and continue to the next iteration of the loop. Something like:

for i, potential_result in enumerate(potential_results): 
    ...
    project_with_potential_result_path = Path(ctx.config['project_path'], potential_result_path)
    try:
        project_with_potential_result_path.resolve()
    except (OSError, RuntimeError) as e:
        _LOGGER.debug(e)
        continue
    if project_with_potential_result_path.exists(): 
        results[potential_result_path] = potential_result 

Thoughts?

That would definitely solve the issue but I think I prefer making sure that the command output is valid because it
(a) targets the issue more specifically (what if something causes dbt ls to emit a valid output that is not a path, I guess we'd like the function to fail) and
(b) it's closer to the actual problem. If the output gets used in a different context, we'd have to handle the ANSI escape sequence there as well.

You have convinced me with your point (b) that regardless of the behavior of the _transform_ls_results function we should handle the ANSI escape sequence closer to the source, as you originally suggested.