knix-microfunctions/knix

workflow log retrieval fails with content with empty lines

iakkus opened this issue · 5 comments

Log retrieval of a workflow fails when the function code logs content with new lines.
To reproduce:

  1. upload a storage object with newlines (e.g., a text file with multiple lines)
  2. deploy a workflow with a function that retrieves that object and logs it (e.g., print(context.get('mykey'))
  3. try retrieving log files in the GUI

The management service retrieveAllWorkflowLogs throws an exception:
File "/opt/mfn/workflow/states/retrieveAllWorkflowLogs/retrieveAllWorkflowLogs/retrieveAllWorkflowLogs.py", line 144, in get_workflow_log hit_str = '[%s] [%s] [%s] [%s] %s' % (source['asctime'], source['loglevel'], source['uuid'], source['function'], source['message']) KeyError: 'message'

docker logs <wid> shows the intended log lines.

Perhaps the fluentbit process doesn't handle such log lines as it should?

The following function shoes the log output, and does not throw an error

def handle(event, context):
    print("45\n89")
    print(3/0)
    return event

Produces logs:
[2020-07-13 14:03:09.577] [INFO] [9527a53ec51111eab60d5e071ba3e3fa] [hello] 45
[2020-07-13 14:03:09.578] [INFO] [9527a53ec51111eab60d5e071ba3e3fa] [hello] 89
[2020-07-13 14:03:09.578] [ERROR] [9527a53ec51111eab60d5e071ba3e3fa] [hello] User code exception: 0
division by zero
Traceback (most recent call last):
File "/opt/mfn/FunctionWorker/python/FunctionWorker.py", line 428, in _fork_and_handle_message
function_output = self._state_utils.exec_function_catch_retry(self._function_runtime, exec_arguments, sapi)
File "/opt/mfn/FunctionWorker/python/StateUtils.py", line 385, in exec_function_catch_retry
ret_value = self._exec_function(runtime, exec_arguments, sapi)
File "/opt/mfn/FunctionWorker/python/StateUtils.py", line 260, in _exec_function
function_output = func(args, sapi)
File "/opt/mfn/workflow/states/hello/hello/hello.py", line 5, in handle
print(3/0)
ZeroDivisionError: division by zero

So fluentbit can handle newlines. Maybe this specific case needs more investigation.

is it with empty lines (message key might get optimized?) just a wild guess :)

Yeah, maybe.

def handle(event, context):
    print("45\n89")
    return event

works.

def handle(event, context):
    print("45\n\n89")

    return event

fails.

The difference is that the printed content has \n\n.

The retrieveAllWorkflows function in the management service now first checks whether the 'message' field exists before using it.

When the content is consisting of multiple lines (e.g., long text file), the empty lines might still get out of order; however, this issue is due to the timestamp field. Not sure whether it is due to the elasticsearch receiving the log line and indexing it at that time and whether we can use another timestamp other than source['asctime'] to determine the order.

A few observations:

  1. empty lines with a just a newline (i.e., '\n') don't seem to be indexed by elasticsearch and/or sent by fluentbit. Possibly related to that, multiple newline characters in one logging statement get translated into one.
  2. print() was splitting the lines and logging each line separately as defined in MicroFunctionsLogWriter. In contrast, MicroFunctionsAPI.log() just logs the entire content in one go. I modfified the print() to also log the content in one go. The reasoning is that at the end of the day, there is a single print() call and whether the content contains multiple lines should not matter.
  3. The retrieveAllWorkflowsLogs now returns output log with source['timestamp'] from the elasticsearch result, which is used by the GUI to sort the statements. This is to help with the cases where source['asctime'] did not have enough precision. With the implementation of 2), this probably won't have much effect; however, I think it is better with this. The progress log stays currently as it was before.