Incorrect results from parse_etld
DylanRJohnston opened this issue · 3 comments
DylanRJohnston commented
When invoking parse_etld
for the domain test.servicebus.windows.net
VRL gives the following results
{
"etld":"servicebus.windows.net",
"etld_plus":"test.servicebus.windows.net",
"known_suffix":true
}
When the correct response is
{
"etld":"net",
"etld_plus":"windows.net",
"known_suffix":true
}
It seems to incorrectly identify the eTLD as servicebus.windows.net
instead of net
.
DylanRJohnston commented
Elastic correctly identifies the eTLD.
// POST /_ingest/pipeline/_simulate
{
"pipeline": {
"description": "eTLD",
"processors": [
{
"registered_domain": {
"field": "message",
"target_field": "url"
}
}
]
},
"docs": [
{
"_source": {
"message": "test.servicebus.windows.net"
}
}
]
}
{
"docs": [
{
"doc": {
"_index": "_index",
"_version": "-3",
"_id": "_id",
"_source": {
"message": "test.servicebus.windows.net",
"url": {
"subdomain": "test.servicebus",
"registered_domain": "windows.net",
"top_level_domain": "net",
"domain": "test.servicebus.windows.net"
}
},
"_ingest": {
"timestamp": "2024-07-03T02:33:31.737388501Z"
}
}
}
]
}
DylanRJohnston commented
Actually it looks like servicebus.windows.net
appears in the public suffix list https://publicsuffix.org/list/public_suffix_list.dat. So perhaps the issue is on the Elastic side 🤔
DylanRJohnston commented
Actually after looking into this more carefully I think Elastic is the one giving the incorrect response here if I understand the semantics of the registered_domain
processor correctly.