duckduckgo/tracker-radar

Meaning of subdomains field

mainzelM opened this issue ยท 4 comments

I'm wondering about the meaning of the "subdomains" field within a resource. As it seems, the subdomains field is either empty (in which case the "rule" field corresponds to the actual URL of the resource) or "subdomains" contains a list of non-empty names, in which case the "rule" field must be prefixed with these names to get the actual URLs. However, in the latter case, is there any information available from which I can conclude whether the "rule" field without any subdomain has also been detected as a actual URL?

That is, if there is a resource

{
    "rule": "foo\\.net\\/bar",
    "subdomains": [
        "www"
     ]
}

how can I conclude whether "foo.net/bar" has been detected (in addition to "www.foo.net/bar")?

Thanks for bringing this to our attention @mainzelM! Unfortunately, turns out that this notation is ambiguous (your example can mean both that the resource was seen on "www.foo.net/bar" only or on both "www.foo.net/bar" and "foo.net/bar"). We consider it a bug and will fix it either by introducing an empty subdomain (<none> or just "") or additional property. That being said, I don't know when exactly we will be able to get to it.

Thanks for the feedback and the clarification, @kdzwinel. I'm glad to hear that you plan to work on this! As I'm deriving blocker lists from the information you provide (https://github.com/mainzelM/ddg-tr-as-easylist), I'm interested in making the rules as concise as possible.

I also came across another, related topic: if a rule includes a CNAME information, e.g.

{
    "rule": "foo\\.net\\/bar",
    "subdomains": [
        "tracker"
     ],
    "cnames": [
            {
              "original": "baz.com",
              "resolved": "tracker.foo.net"
            }
     ]
}

I'd be interested in the information, whether "tracker.foo.net/bar" was seen in addition to "baz.com". Currently, I cannot derive this from the data above, right?

Hey @mainzelM sorry for late response.

I'd be interested in the information, whether "tracker.foo.net/bar" was seen in addition to "baz.com". Currently, I cannot derive this from the data above, right?

I believe that you are right - you can't tell that ATM. We may release raw crawl data at some point, but I don't have ETA>

https://github.com/mainzelM/ddg-tr-as-easylist

That's awesome to see ๐Ÿ‘

Can you please take the bug off thank you