datenanfragen/data

Record overlap: `rewe-shop`, `rewe-group-com`

mal-tee opened this issue · 7 comments

Both have "Rewe Markt GmbH" in the runs-Array. Seems like a mistake we should resolve?

Thank you for opening this issue (based on my email).

Should we turn this into a test? @baltpeter

I haven't looked into that particular case yet. Are we sure that that is a mistake?

But, either way, we can't generally forbid two records having identical runs entries. There are already valid records where that is the case, e.g. the Amazon records for different companies:

https://github.com/datenanfragen/data/blob/master/companies/amazon-de.json
https://github.com/datenanfragen/data/blob/master/companies/amazon-es.json

I haven't looked into that particular case yet. Are we sure that that is a mistake?

Haven't looked either. 😅

But, either way, we can't generally forbid two records having identical runs entries. There are already valid records where that is the case, e.g. the Amazon records for different companies:

master/companies/amazon-de.json master/companies/amazon-es.json

Yeah, we should only do that test if there is no overlap in the countries. 🤔

Yeah, we should only do that test if there is no overlap in the countries. thinking

If there is overlap in the countries, you mean, right?

But even then, I'm not sure whether there can never be a case where that is valid…

If there is overlap in the countries, you mean, right?

Yes, oops.

I wrote a little script to implement this:

from collections import defaultdict
import os
import json

hashmap = defaultdict(list)

for file in os.listdir("companies/"):
    with open("companies/" + file, "r") as f:
        company = json.load(f)
        slug = company["slug"]
        hashmap[company["name"]].append(slug)
        if "runs" in company:
            for run in company["runs"]:
                hashmap[run].append(slug)

simple_overlap = {k: v for k, v in hashmap.items() if len(v) > 1}
print("simple", len(simple_overlap.keys()))
for name, slugs in simple_overlap.items():
    used_rvs = defaultdict(list)
    alls = set()
    for slug in slugs:
        with open("companies/" + slug + ".json", "r") as f:
            company = json.load(f)
            if "relevant-countries" in company:
                if company["relevant-countries"] == ["all"]:
                    alls.add(name)
                else:
                    for rv in company["relevant-countries"]:
                        used_rvs[rv].append(slug)
    filtered_overlap = {k: v for k,v in used_rvs.items() if len(v) > 2 or name in alls}
    if(filtered_overlap):
        print(name, filtered_overlap, alls)
simple 38
REWE Markt GmbH {'de': ['rewe-shop']} {'REWE Markt GmbH'}
Ideawise Limited {'de': ['gay-de', 'fetisch-de', 'poppen-de', 'kaufmich-com']} set()
Seven.One Entertainment Group GmbH {'de': ['sat1gold', 'prosieben', 'kabeleinsdoku', 'kabeleins']} set()
cpx online active AG {'de': ['optivel'], 'ch': ['optivel'], 'fr': ['optivel'], 'at': ['optivel']} {'cpx online active AG'}
Ingenico Payment Services GmbH {'de': ['ingenico-de']} {'Ingenico Payment Services GmbH'}
Ingenico Healthcare GmbH {'de': ['ingenico-de']} {'Ingenico Healthcare GmbH'}
  1. the initial case for this issue. Seems legit, since the websites are different.
  2. websites are different.
  3. same
  4. ...

Yeah, we'd also have to check if the websites are different. And probably every other key as well.


However, we can close this issue: The rewe group collision is okay, since the webpages are different.

I see my original concern as unresolved. The database currently shows 2 officials for REWE Markt GmbH:

  • REWE Markt GmbH
  • REWE Zentralfinanz eG

As I understand it, this cannot be the case, as the unambiguity is missing.
Which sources indicate that REWE Zentralfinanz eG is also responsible for REWE Markt GmbH? I have not been able to verify this so far.