duckduckgo/tracker-radar

Change of ownership for w55c.net

billfitzgerald opened this issue · 3 comments

Not sure the best place to put this, so adding this as an issue. If there is a preferred way to flag what appears to be changes in ownership that affect the underlying data, please let me know.

w55c.net (documented in domains/US/w55c.net.json used to be owned by dataxu.com - however, at present, dataxu.com (and the privacy policy at https://www.dataxu.com/about-us/privacy/data-collection-platform/ ) redirect to https://advertising.roku.com/

It looks like this acquisition happened in late 2019; I'm not sure when the redirects went into effect.

https://newsroom.roku.com/news/2019/11/roku-completes-dataxu-acquisition/xspfj-mz-1607554314#!

The calls to w55c.net appear to be unchanged (at least from the few I have observed), but the ownership needs to be updated.

dharb commented

@billfitzgerald thank you for flagging, it's much appreciated. Filing issues here for ownership updates is perfect.

I'll open a PR to merge the DataXu entity into the Roku entity.

Cool - if you show me the commits that are required, I can generate pull requests in the future - I'm assuming that it's an update to the Roku, Inc file, and an additional update to the w55c.net.json file to reflect the changed parent domain and privacy policy url?

I didn't want to generate a pull request that missed anything/got it wrong, but if what I'm describing sounds accurate I'd be glad to do that in the future.

And yeah - tracking churn across 20,000 domains is a pretty Sisyphean task :)

dharb commented

Hey @billfitzgerald, sorry to leave you hanging on this - I wanted to better explain things, but didn't have a chance until now.

As you discovered, we have entity files that contain a list of belonging to each company, and we also have domain files that contain data about what we saw each domain doing in our most recent crawl.

These files are actually generated from other data files, and updated on different cadences:

  • Entity files are updated each time we make a change to the 'source of truth' file, entity_map.json. That's why in the PR above you see that there were corresponding changes in both entity_map.json and Roku, Inc..json
  • Domain files are updated on each monthly crawl update using entity_map.json and privacy_policies.json. This is why you see that the PR above doesn't include changes to individual domain files - those files will be updated when we add the data from our monthly crawl.

Ownership changes - updating company names, merging companies together, adding missing domains, etc

I don't manually update the individual files for entity changes - I just make changes to entity_map.json, then run a script to propagate these changes. Technically you could make all of the relevant changes manually by following the changes in the PR above, but there's a much simpler way. It does require some basic git and node knowledge, so if you'd rather avoid that you can feel free to just open PRs including any changes to entity_map.json and privacy_policies.json and assign them to me, and I can take it from there. Anyway, here's how it works:

  • First, clone this repository and checkout the main branch
  • Next, clone the tracker-radar-detector repository. I put them both in the same parent directory.
  • Now make any changes you'd like to the entity_map.json file in tracker-radar and save it. See the PR above for what these changes might look like. Note how I merged DataXu into Roku by removing DataXu, adding it as an alias of Roku, and moving its domains over to Roku.
  • Next, in the tracker-radar-detector directory you'll want to update the config.json file so trackerDataLoc points to the location of the tracker-radar project that you just cloned. You'll also want to run a quick npm i to install the require node modules. These steps only need to be done for the initial set up.
  • Next, run npm run apply-entity-changes from the tracker-radar-detector directory. This will propagate the changes you made to entity_map.json to the relevant files in the tracker-radar directory. There's a bit more documentation on that here

Privacy policy changes

This is much simpler - all that's required here is to edit the privacy_policies.json file, and that change will be propagated to domain files on the next crawl update. Feel free to add missing data there - we don't keep up with it as well as we should.

Please don't hesitate to reach out if you have any questions or run into issues. I can't tell you how much I appreciate your contributions.