duckduckgo/tracker-radar

Possible workflow (and WIP) for identifying owners of domains, and privacy policies

billfitzgerald opened this issue · 4 comments

tl;dr: I might be able to jumpstart identifying owners for domains that are currently unaffiliated, and I have a decent method of identifying likely candidates for privacy policies in an automated way.

This is a WIP; sharing details here to see if you are interested in this work, and to make sure it aligns and would be useful for the project.

Details:

Parse all records in the tracker-radar/domains/US directory. Identify all domains that do not have an owner.

For every domain that does not have an owner, get up to 5 subdomains. If a record doesn't have a subdomain, use the base domain and suffix.

Send a request for headers to every site in the list at http and https (only headers - no need to be rude and hit the full site). Record response codes. This requires calls to 33,798 locations (16,899, contacted via both http and https). This step is complete, I'm happy to share this dataset if you're interested.

This gives us a range of useful information, including:

  1. sites that support http and https
  2. Response codes at each subdomain (2xx, 3xx, 4xx, 5xx)

By cross referencing what sites support http/https and response codes at each location, we can infer a range of things (that's a longer and separate conversation). For this specific use case, we can use the protocol and response code results to flesh out ownership of these domains.

I'm starting with all domains that have a subdomain that responded with a 2xx response code and supported https on at least 1 subdomain (4676 unique domains). I'm loading each domain via Selenium, passing the results to BeautifulSoup, and then focusing on the contents of all "```a``` tags, specifically the text displayed in the tag and the linked url. If either the text or the url contains "privacy", "terms", "legal", etc, I store the url.

Using this process, I'm generating an additional csv file that includes:

  • starting_url,
  • current_url (if the site redirects, the eventual location)
  • page title (often contains the company name)
  • relevant urls of policies
  • relevant text associated with urls

This csv file can jumpstart the process of adding additional entity records for domains that are currently not affiliated with any entity, and identifying the privacy policies. The method of identifying privacy policies can also work for domains that are currently mapped to an owner, but that do not have a privacy policy listed.

Once I finish the 2xx/https domains, I'll probably process the 3xx/https domains - these are domains that might have been acquired, or are possibly up to no good, so I'll need to exercise additional caution about the device I use to gather information from them. Or, I might jump to the 4xx/https domains, as those could potentially be more legit sites (ie, they use https, and they do not allow easy access to randos on the internet)

CAVEAT: This issue might be premature. The steps I'm outlining here are a WIP. Initial testing looks good, but it's not done until it's done, and because a lot of these sites are, at best, dodgy, they often behave in ways that are, well, curious, which makes data collection more interesting than I'd prefer.

Okay - the yield here looks pretty good. I ran just over 1K domains earlier today, and that in turn generated about 450 domains with a privacy policy and a distinct page title that are currently not tracked.

Some of the "positives" are actually parked domains - I don't know how many, but from eyeballing page titles there are some.

But yeah - this first test pass looks decent.

Okay - after some testing and work, this is definitely a viable approach (for me, anyways) for identifying owners of currently unclaimed domains.

More to come.

Okay - this is coming together.

This screencast shows some of the details: https://vimeo.com/680551075

But the short version: for domains that don't have owners, we can store ownership info in a csv.

That csv is processed, and:

  • checked for possible duplicates in privacy_policies.json and entity_map.json
  • for entries with no dupes, outputs json to be added to privacy_policies.json and entity_map.json
  • for entries with dupes, the script outputs a list of where the duplicates occurred.

The script is here: https://gist.github.com/billfitzgerald/d8e5a1af729865f4b00b21eb9eeb980a

There are some additional details, but this is the short-ish version.

This issue can be closed - the process works, and leads to updates as documented in this pull request: #124