log.offset not ECS compliant?

Question

log.offset not ECS compliant?

Opened this issue a year ago · 16 comments

Summary

The field log.offset seems to be a standard when using any of the Elastic Integrations and the file input type. It seems logical that this become a standard ECS field or something equivalent because the line number is very helpful when ingesting from files.

Motivation:

The motivation came from when I was developing my first Elastic integration and was surprised to see that log.offset (which automatically got added to my data ingest) was not found in the ECS. I was using a sample ingested event that contained this field. I ended up removing it from my sample event because I imagine something under the hood in Elastic is creating and mapping this field.

Detailed Design:

Field names

log.offset

Example values for the fields

120

Suggested appropriate datatypes

Type: Long

Any example events that map to the proposed use case(s)

About any file that is ingested with filebeat will contain this field.

Really, I am just asking the question, why shouldn't this exist in the ECS?

Does log.origin.file.line replace it?

Answer 1 · 2023-07-13T20:52:41.000Z

Does log.origin.file.line replace it?

No, log.origin.file.line is the line number read from the original file. I believe log.offset is a cursor offset in bytes from the beginning of the file.

log.offset was originally in ECS but removed pre-1.0.0: #131. With log.offset populated by the log (and now filestream) input, it may have been considered too low-level to capture in a common schema at the time. There are other input-specific fields not defined in ECS, like input.type.

Answer 2 · 2023-07-13T22:34:38.000Z

Okay, that makes sense. So it is possible that fields that could be used very commonly won't need to be ECS compliant? I thought this could be a slam fun for a field but looks like that road has already been navigated :)

Answer 3 · 2023-07-13T22:36:25.000Z

Also, is using this custom field an issue since log.* is reserved for ECS?

Answer 4 · 2023-07-17T19:34:42.000Z

If there are no plans to address the log.offset I can go ahead to close this Issue and move on. I just thought it was an interesting find, but apparently not so much haha

Answer 5 · 2023-07-20T21:12:18.000Z

Also, is using this custom field an issue since log.* is reserved for ECS?

Yes, it's breaking the conventions a bit. There's a handful of legacy fields in places, like Beats, that are commonly used, predate ECS, but for one reason or another never formally added to the schema.

If there are no plans to address the log.offset I can go ahead to close this Issue and move on.

No plans currently, but we can also leave this discussion open if anyone has thoughts to share around log.offset.

Answer 6 · 2023-07-20T21:26:56.000Z

I think it could be a worthwhile discussion. I will leave it open for a bit until I get a better understanding of the scope of the situation. Thanks for your input!

Answer 7 · 2023-12-17T19:17:05.000Z

@ruflin do you remember perhaps what was the reasoning behind removing log.offset type deninition and how should we continue or not with this ?

Answer 8 · 2023-12-19T09:07:20.000Z

The main reason was that for the initial release of ECS we wanted to keep the scope to a minimum. It is easy to add the fields, but really hard to remove them later, see also #131.

It leaves open the question, what should we do with log.offset. For anything shipped to Elastic, we should make sure the mapping is accurate even if it is not part of ECS. @felixbarny Should it be mapped by a default template? @AlexanderWert Is there anything similar in semconv? @cmacknz I assume filestream uses the same field?

Answer 9 · 2023-12-19T11:19:13.000Z

Mapping with a default template sounds reasonable to me. The offset fields takes notable amount of space and is rarely searched on, so disabling indexing would make sense.

Answer 10 · 2023-12-19T12:32:52.000Z

@AlexanderWert Is there anything similar in semconv?

In semconv there's there's nothing similar.
There's a log.record.uid but that's different, representing a unique record id.

Answer 11 · 2023-12-19T14:00:07.000Z

The offset fields takes notable amount of space and is rarely searched on, so disabling indexing would make sense.

Agree, also something we are discussing internally. @StephanErb If I remember correctly, you use it for sorting if timestamps are identical? Will you still need it with nano precision or does the field then become obsolete?

Answer 12 · 2023-12-19T16:12:54.000Z

@cmacknz I assume filestream uses the same field?

Yes

Answer 13 · 2023-12-19T16:22:02.000Z

We planned to do the sorting based on the offset, but never got it fully working. As you have said, once nanoseconds come in the offset would not really be needed any longer.

Answer 14 · 2023-12-19T20:12:24.000Z

We planned to do the sorting based on the offset, but never got it fully working. As you have said, once nanoseconds come in the offset would not really be needed any longer.

So how do we stop this field from being created, as close as possible to the filebeat(source) level? I have some hundreds of Agents configured with different policies that contain file log collection/ingestion. What would be the best/recommend approach here?

Answer 15 · 2023-12-20T07:16:18.000Z

Ideally, I think we should make the field "opt-in" on the edge (in filebeat). The problem is that some users will consider this a breaking change, so we need to figure out how to do it gracefully.

I did a quick try with filebeat on how to do it manually and the following seems to work:

processors:
- drop_fields:
    fields: ["log.offset"]
    ignore_missing: true

This drops the field directly on the edge.

Answer 16 · 2023-12-20T07:23:51.000Z

As a follow up I have filed elastic/elastic-agent#3934