CybercentreCanada/assemblyline

Suricata service can be stuck for hours if suricata didn't start

kam193 opened this issue · 5 comments

Describe the bug
Recently, I've recognized a few times that Suricata service stopped processing files. After analysing what's going on, I've found that Suricata is not running, and the service seems to have no timeout for the initialization. I'm not sure why Suricata isn't running, I haven't found any tracks in logs, neither from the service nor the Suricata inside the container. Restarting container is enough to repair it.

The behaviour of the service looks a little unhealthy as it's trying to reach Suricata in an infinite loop, throwing RecoverableError what prevents scaler from stepping in and recreating the service.

To Reproduce
Steps to reproduce the behavior:

Unfortunately, I don't know an easy way. I assume killing the Suricata process or modifying the service not to start Suricata at all could reproduce the behaviour.

Expected behavior
I would like the service to have a timeout / limit for Suricata to start, and just give up throwing a normal error. Using RecoverableError for the time of the initialization is perfectly fine, but should be stopped after some time.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information if pertinent):

  • Assemblyline Version: 4.5.0.18, Suricata service 4.5.0.5

Additional context
The service container has logs like below. Note the time - between the start and last sample are two hours of throwing just RecoverableError.

{"@timestamp": "2024-04-22 14:30:04,319", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Launching Suricata: /usr/local/bin/suricata -vvvv -c /etc/suricata/suricata.yaml --unix-socket=/var/run/suricata/suricata.socket --pidfile /var/run/suricata/suricata.pid --set logging.outputs.1.file.filename=/var/log/suricata/suricata.log"}
{"@timestamp": "2024-04-22 14:30:04,320", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Suricata not started yet: [Errno 2] No such file or directory"}
...
{"@timestamp": "2024-04-22 14:30:18,322", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Ruleset 0: 71348 rules loaded"}
{"@timestamp": "2024-04-22 14:30:18,322", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "WARNING", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Ruleset 0: 3458 rules failed to load.This can be due to duplication of rules among muliple rulesets being loaded."}
{"@timestamp": "2024-04-22 14:30:18,323", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Suricata started with service version: 4.5.0.5"}
{"@timestamp": "2024-04-22 14:30:18,353", "event": { "module": "assemblyline", "dataset": "assemblyline.odm" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "WARNING", "logger": "assemblyline.odm" }, "process": { "pid": "1" }, "message": "The following parameters were ignored from object 'Service': monitored_keys"}
{"@timestamp": "2024-04-22 14:30:18,378", "event": { "module": "assemblyline", "dataset": "assemblyline.dispatching.client" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.dispatching.client" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj/704e5e5b3234433c01fcfd1b20a306e77e985038120492dc53965c3edd38a4ea] Suricata:f2680f09c0a4 task found"}
{"@timestamp": "2024-04-22 14:30:18,396", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] New task received"}
{"@timestamp": "2024-04-22 14:30:18,396", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] Downloading file: 704e5e5b3234433c01fcfd1b20a306e77e985038120492dc53965c3edd38a4ea"}
{"@timestamp": "2024-04-22 14:30:18,409", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] Starting task for file: 704e5e5b3234433c01fcfd1b20a306e77e985038120492dc53965c3edd38a4ea (network/tcpdump)"}
{"@timestamp": "2024-04-22 14:31:19,037", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Recoverable Service Error (3BH0jBEinXRrBLTJlPFgnj/704e5e5b3234433c01fcfd1b20a306e77e985038120492dc53965c3edd38a4ea) [Errno 104] Connection reset by peer:   File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/common/base.py\", line 181, in handle_task\n    self.execute(request)\n  File \"/opt/al_service/suricata_/suricata_.py\", line 512, in execute\n    raise RecoverableError(e)\nRecoverableError: [Errno 104] Connection reset by peer"}
{"@timestamp": "2024-04-22 14:31:19,038", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] Saving error to: /tmp/3BH0jBEinXRrBLTJlPFgnj_704e5e5b3234433c01fcfd1b20a306e77e985038120492dc53965c3edd38a4ea_error.json"}
{"@timestamp": "2024-04-22 14:31:19,042", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "There are no new signatures. (1713789546.8030632 >= 1713789546.8030632)"}
{"@timestamp": "2024-04-22 14:31:19,042", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] Task completed with errors"}
{"@timestamp": "2024-04-22 14:31:19,042", "event": { "module": "assemblyline", "dataset": "assemblyline.tasking_client" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.tasking_client" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] f2680f09c0a4 - Suricata failed to complete task "}
...
{"@timestamp": "2024-04-22 14:33:13,160", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Suricata not started yet: [Errno 111] Connection refused"}
{"@timestamp": "2024-04-22 14:33:23,160", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Suricata not started yet: [Errno 111] Connection refused"}
{"@timestamp": "2024-04-22 14:33:23,161", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Recoverable Service Error (3BH0jBEinXRrBLTJlPFgnj/1ffd5e0894eef1ef357e5554ada335827398568b8287331221dff8abe4c42622) RetryError[Attempts: 15, Value: False]:   File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/common/base.py\", line 181, in handle_task\n    self.execute(request)\n  File \"/opt/al_service/suricata_/suricata_.py\", line 476, in execute\n    self.start_suricata_if_necessary()\n  File \"/opt/al_service/suricata_/suricata_.py\", line 145, in start_suricata_if_necessary\n    raise RecoverableError(e)\nRecoverableError: RetryError[Attempts: 15, Value: False]"}
...
{"@timestamp": "2024-04-22 16:25:03,645", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Suricata not started yet: [Errno 111] Connection refused"}
{"@timestamp": "2024-04-22 16:25:03,645", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Recoverable Service Error (77rC62QgvSPJeHvVdIp6dY/3382b9a14870049b854563106c33caa3da560b215a60e175e0aedfeff27a562b) RetryError[Attempts: 15, Value: False]:   File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/common/base.py\", line 181, in handle_task\n    self.execute(request)\n  File \"/opt/al_service/suricata_/suricata_.py\", line 476, in execute\n    self.start_suricata_if_necessary()\n  File \"/opt/al_service/suricata_/suricata_.py\", line 145, in start_suricata_if_necessary\n    raise RecoverableError(e)\nRecoverableError: RetryError[Attempts: 15, Value: False]"}
{"@timestamp": "2024-04-22 16:25:03,646", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[77rC62QgvSPJeHvVdIp6dY] Saving error to: /tmp/77rC62QgvSPJeHvVdIp6dY_3382b9a14870049b854563106c33caa3da560b215a60e175e0aedfeff27a562b_error.json"}

The /var/log/suricata/suricata.log is finished at 14:30 with a fail of loading some signatures, what isn't a reason to stop Suricata, and looks the same as when everything works.

Can you confirm if this is still happening on more recent versions of the Suricata service?

I've just disabled my workaround (script looking for stuck Suricata services every 30 minutes, it was impossible to work without it) - I'll come back in a few days with results.

I'm confirming this is still an issue - with Suricata 4.5.0.10 and AL v26

Want to try testing with the latest stable?

The service was updated to use Suricata 8.0-dev and a flag was added to the command to run as daemon which I'm hoping should resolve the issue mentioned (unsure if that flag was available in previous versions).

So... I'm not really able to test it, as on my server the newest Suricata (service v16 or even v15) doesn't even start:

{"@timestamp": "2024-06-06 20:59:55,903", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "6879c90012be" }, "log": { "level": "ERROR", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Exiting:\nTraceback (most recent call last):\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_core/server_base.py\", line 100, in run\n    self.try_run()\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/run_privileged_service.py\", line 148, in try_run\n    self.service_tool_version = self.service.get_tool_version()\n                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/al_service/suricata_/suricata_.py\", line 114, in get_tool_version\n    return f\"{self.get_suricata_version()}.r{self.rules_hash}\"\n              ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/al_service/suricata_/suricata_.py\", line 107, in get_suricata_version\n    return safe_str(subprocess.check_output([\"suricata\", \"-V\"]).strip().replace(b\"This is Suricata version \", b\"\"))\n                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/subprocess.py\", line 466, in check_output\n    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/subprocess.py\", line 571, in run\n    raise CalledProcessError(retcode, process.args,\nsubprocess.CalledProcessError: Command '['suricata', '-V']' died with <Signals.SIGILL: 4>.\n"}

What is caused by:

kernel: traps: suricata[2946516] trap invalid opcode ip:56509c8df61c sp:7ffd368f99e0 error:0 in suricata[56509c8c5000+5e1000]

Indicating incompatible binary instruction... Strange.

To be honest, even when I believe that looking for the root cause is important, my suggestion is to rather prepare for cases like surprisingly dead suricata process. When it's not the main process in the docker container, it could be killed for any reason (even manually - good for testing), and the container will live further. I'd add a check in the error handling to see if the process is still running or exited.