vectordotdev/vrl

Incorrect results from parse_etld

DylanRJohnston opened this issue · 3 comments

When invoking parse_etld for the domain test.servicebus.windows.net VRL gives the following results

Playground Link

{
  "etld":"servicebus.windows.net",
  "etld_plus":"test.servicebus.windows.net",
  "known_suffix":true
}

When the correct response is

{
  "etld":"net",
  "etld_plus":"windows.net",
  "known_suffix":true
}

It seems to incorrectly identify the eTLD as servicebus.windows.net instead of net.

Elastic correctly identifies the eTLD.

// POST /_ingest/pipeline/_simulate
{
    "pipeline": {
        "description": "eTLD",
        "processors": [
            {
                "registered_domain": {
                    "field": "message",
                    "target_field": "url"
                }
            }
        ]
    },
    "docs": [
        {
            "_source": {
                "message": "test.servicebus.windows.net"
            }
        }
    ]
}
{
    "docs": [
        {
            "doc": {
                "_index": "_index",
                "_version": "-3",
                "_id": "_id",
                "_source": {
                    "message": "test.servicebus.windows.net",
                    "url": {
                        "subdomain": "test.servicebus",
                        "registered_domain": "windows.net",
                        "top_level_domain": "net",
                        "domain": "test.servicebus.windows.net"
                    }
                },
                "_ingest": {
                    "timestamp": "2024-07-03T02:33:31.737388501Z"
                }
            }
        }
    ]
}

Actually it looks like servicebus.windows.net appears in the public suffix list https://publicsuffix.org/list/public_suffix_list.dat. So perhaps the issue is on the Elastic side 🤔

Actually after looking into this more carefully I think Elastic is the one giving the incorrect response here if I understand the semantics of the registered_domain processor correctly.