elastic/ecs

log.offset not ECS compliant?

Opened this issue · 16 comments

Summary

The field log.offset seems to be a standard when using any of the Elastic Integrations and the file input type. It seems logical that this become a standard ECS field or something equivalent because the line number is very helpful when ingesting from files.

Motivation:

The motivation came from when I was developing my first Elastic integration and was surprised to see that log.offset (which automatically got added to my data ingest) was not found in the ECS. I was using a sample ingested event that contained this field. I ended up removing it from my sample event because I imagine something under the hood in Elastic is creating and mapping this field.

Detailed Design:

  • Field names

log.offset

  • Example values for the fields

120

  • Suggested appropriate datatypes

Type: Long

  • Any example events that map to the proposed use case(s)

About any file that is ingested with filebeat will contain this field.

Really, I am just asking the question, why shouldn't this exist in the ECS?

Does log.origin.file.line replace it?

Does log.origin.file.line replace it?

No, log.origin.file.line is the line number read from the original file. I believe log.offset is a cursor offset in bytes from the beginning of the file.

log.offset was originally in ECS but removed pre-1.0.0: #131. With log.offset populated by the log (and now filestream) input, it may have been considered too low-level to capture in a common schema at the time. There are other input-specific fields not defined in ECS, like input.type.

Okay, that makes sense. So it is possible that fields that could be used very commonly won't need to be ECS compliant? I thought this could be a slam fun for a field but looks like that road has already been navigated :)

Also, is using this custom field an issue since log.* is reserved for ECS?

If there are no plans to address the log.offset I can go ahead to close this Issue and move on. I just thought it was an interesting find, but apparently not so much haha

Also, is using this custom field an issue since log.* is reserved for ECS?

Yes, it's breaking the conventions a bit. There's a handful of legacy fields in places, like Beats, that are commonly used, predate ECS, but for one reason or another never formally added to the schema.

If there are no plans to address the log.offset I can go ahead to close this Issue and move on.

No plans currently, but we can also leave this discussion open if anyone has thoughts to share around log.offset.

I think it could be a worthwhile discussion. I will leave it open for a bit until I get a better understanding of the scope of the situation. Thanks for your input!

@ruflin do you remember perhaps what was the reasoning behind removing log.offset type deninition and how should we continue or not with this ?

The main reason was that for the initial release of ECS we wanted to keep the scope to a minimum. It is easy to add the fields, but really hard to remove them later, see also #131.

It leaves open the question, what should we do with log.offset. For anything shipped to Elastic, we should make sure the mapping is accurate even if it is not part of ECS. @felixbarny Should it be mapped by a default template? @AlexanderWert Is there anything similar in semconv? @cmacknz I assume filestream uses the same field?

Mapping with a default template sounds reasonable to me. The offset fields takes notable amount of space and is rarely searched on, so disabling indexing would make sense.

@AlexanderWert Is there anything similar in semconv?

In semconv there's there's nothing similar.
There's a log.record.uid but that's different, representing a unique record id.

The offset fields takes notable amount of space and is rarely searched on, so disabling indexing would make sense.

Agree, also something we are discussing internally. @StephanErb If I remember correctly, you use it for sorting if timestamps are identical? Will you still need it with nano precision or does the field then become obsolete?

@cmacknz I assume filestream uses the same field?

Yes

We planned to do the sorting based on the offset, but never got it fully working. As you have said, once nanoseconds come in the offset would not really be needed any longer.

We planned to do the sorting based on the offset, but never got it fully working. As you have said, once nanoseconds come in the offset would not really be needed any longer.

So how do we stop this field from being created, as close as possible to the filebeat(source) level? I have some hundreds of Agents configured with different policies that contain file log collection/ingestion. What would be the best/recommend approach here?

Ideally, I think we should make the field "opt-in" on the edge (in filebeat). The problem is that some users will consider this a breaking change, so we need to figure out how to do it gracefully.

I did a quick try with filebeat on how to do it manually and the following seems to work:

processors:
- drop_fields:
    fields: ["log.offset"]
    ignore_missing: true

This drops the field directly on the edge.

As a follow up I have filed elastic/elastic-agent#3934