COVESA/dlt-daemon

Trace budgeting mechanism

Opened this issue · 7 comments

I'm interested in fighting trace spam risks for my current project when one domain within the complex automotive system is 'eating up' much of the dlt-daemon logging bandwidth.

I considered investigating whether extending the number of messages that dlt-daemon can process per second is possible. In my current environment, it is ~5000 messages/second, after which the dlt-daemon drops the messages with quite a significant CPU load. After my investigation, I found that improving this significantly is impossible. Also, I remember the best practices and that dlt-daemon is not intended for heavy tracing of the low-level data.

The other way is to have a per-application and ( or ) context ID trace budgeting mechanism to suppress trace spamming processes/contexts.

I've seen the following non-merged PR:
#134

So, I'm not the only one who wanted such a feature. But it was not merged; thus, before starting development, I want to cross-check with maintainers the following points:

  • Would it be a new feature for the dlt-daemon? Or am I missing something, and does it already exist?
    I'm asking because I've seen the following thread on the StackOverflow:
    https://stackoverflow.com/questions/72269739/ubuntu-dlt-tool-trace-load-exceeded-trace-hard-limit-1-messages-discarded
    It describes exactly the feature I want with warning messages when the trace-spam domain is hitting the limit:
    image
    Also, I remember that when I was working on one of the OEM's projects, I had this exact feature in the dlt-daemon. The one that is described in that StackOverflow thread:

    I got to know, this problem is related to trace load limits (soft limit and hard limit) mentioned in Payload column.

    These limits should be set in configuration file dlt-trace-load.conf for each application which is using dlt daemon. These limits should be defined with corresponding application id. Soft_limit: The warning limit, if more data than this is logged, a warning is written into dlt. Hard_limit: If an application surpasses this limit, data will be discarded and a warning will be logged!

    That's why I was almost sure it existed in the official delivery. However, I could not find any information regarding the 'dlt-trace-load.conf' file in this repository or elsewhere. So, this feature might exist in a private OEM's patched version of the dlt-daemon, and someone accidentally posted about it on StackOverflow.

    => Do maintainers know something about this implementation? Can we all get this feature without implementing it from scratch?

  • If not, is it OK to introduce such a feature? Or are there some significant objections to having it at all? I'm asking because previous PR related to a similar topic was rejected. That's why I would like to know beforehand that maintainers are okay with the idea of such a feature, not to throw away my team's efforts.

  • If you approve of implementing it, would it be OK if I create some architectural diagrams and post them to this thread to align with maintainers on the possible implementation? I want to implement it properly right away, not to spend my and your time on endless reviews.

I am looking forward to getting your feedback! ))

Hello @svlad-90
It is nice of you to raise your concern and your interest to DLT.

For your proposal, IMHO, I am okay with the feature, the only thing we need to worry about is making sure that the implementation will not affect the current mechanism, APIs, or violating AUTOSAR Standard/specification, and, not breaking any unittest for current features, etc
I can do the validation, testing and checking for your implementation later in review phase.
You can go ahead with the diagrams, mechanisms, PRs and do not worry at all, we will support you, since the last PR is closed due to the author's account inactive, and we cannot process if the contributor dropping that way.
For dlt-trace-load.conf , honestly I have no idea what this file is and for 😀
Maybe you right about this is from some commercial version from some partners in the alliance.

About this point:

If you approve of implementing it, would it be OK if I create some architectural diagrams and post them to this thread to align with maintainers on the possible implementation? I want to implement it properly right away, not to spend my and your time on endless reviews.

I also not touching much on DLT tracing, just the logging, so it's fine for me to involve in this topic.
I have no objection, let's work together.
Looking forward to your response!

Hi @minminlittleshrimp,

Thank you very much for your feedback and for being ready to collaborate!

As I'm working on a customer's project in my company, I'll need to plan these activities properly with my management. So, for your information, it might take 1-4 weeks until this task is part of the sprint, and I'm finally back with the diagrams.

But this feature seems crucial for our customer, who has chosen to use DLT as part of its technology stack, so there is a low chance that we will abandon it. ))

Just a notification that I didn't forget about this issue. I'll definitely get to it soon.

Hey everyone. At Mercedes-Benz, we took this PR and extended it heavily. It now supports configuration of trace load limits via a file.
I'm currently in process of extending it even more, so it will support adding limits for context ids as well. I will create a MR for this as soon as we're done testing internally. @svlad-90 If you can wait ~2 weeks don't start with it yet :)
Let me know if you cant wait, I could upstream the version we already have. But I would prefer to add the context id filtering first.

The configuration is sent from the daemon via an application message to the client.
Without the extension of filtering based on context IDs we have been using this for a while now, so it's battle tested.

Our implementation does not break unit tests (in fact it adds quite a few) and it does not change the api in any way.

Sneak peek of the configuration file

# Configuration file for DLT daemon trace load settings
# This file allows configuration of trace load limits
# If no limit is set here the defaults will be used.
# They are configured via the two defines below in the source code.
# TRACE_LOAD_DAEMON_SOFT_LIMIT
# TRACE_LOAD_DAEMON_HARD_LIMIT
#
# APPID: The application id to limit
# CTXID: The optional context id to limit, if the context id is not given, the limit is applied to all contexts of the application
#        Therefore the best match is used, a context can override the limit of the application, as each line will be
#        treated as separate quota.
# SOFT_LIMIT: The warning limit, if more data than this is logged, a warning is written into dlt
# HARD_LIMIT: If an application surpasses this limit, data will be discarded and a warning will be logged!
# SOFT_LIMIT and HARD_LIMIT are in byte/s
# Warnings will be issues in the interval configured via DLT_USER_HARD_LIMIT_OVER_MSG_INTERVAL
# the default for this is 1s
#
# !!!!
# Note: this file is space separated, and wildcards are not supported
# !!!!
#
# APPID [CTXID] SOFT_LIMIT HARD_LIMIT

# Allow 100000 bytes for all contexts on JOUR
SYS 83333 100000

# Allow QSYM to log 100000 bytes, but only on context QSLA
QSYM QSLA 83333 100000

# Allow total 5555 bytes for all contexts on TEST
# But only 100 bytes for context FOO
TEST 2222 5555
TEST FOO 42 100

# BAR BAR gets 84 bytes
# Every other context in BAR gets 42 bytes
BAR 42 42
BAR BAR 84 84


Hi @alexmohr, the only thing I can say is Hooray! ))

Having that same approach already tested in Mercedes projects is a huge benefit. I was using it in one of them while fighting with the trace spam. It will save a lot of time for my current project. So thanks a lot!

I've already started developing a sequence diagram that would involve creating a new message type between the dlt-daemon and the dlt lib—DLT_USER_MESSAGE_LOG_BUDGET. But now it is not needed. Also, it seems your approach uses the same mechanism to distribute the data from the dlt-daemon to all the applications. So, there is no need to do an architectural proposal.

I can wait 2-4 weeks—there are no issues with that.

Also, let me know if you need any help from my side. I'm interested in this change, so I can assist if you need it.

Again, thank you so much for your proposal! ))

I'm happy to hear that. I'll mention you on the MR once I created it, so you can help reviewing it, as it will be a rather large change (although many of the changes are tests).
Aside from that there isn't much to help right now. I'm currently testing the implementation of different platforms and once I'm happy and it passed our internal review upstreaming will follow.