ldbc/ldbc_snb_datagen_spark

People4Hops factor table does not guarantee paths between two persons

GLaDAP opened this issue · 2 comments

For SNB Interactive, for parameter selection the People4Hops factor table is used to get two persons with a path that is 4 hops long.
However, not all person pairs in the people4Hops have a path.

To reproduce:

  1. Using the generate-all.sh script from ldbc_snb_interactive_impls, generate SF1.
  2. Load SF1 into Neo4j using the implementation from the implementations repository
  3. Using Q13 with the following input: Person1Id=24189255812497, Person2Id=4398046513909 (this is from the people4Hops factor table), no path is found. The creation date/deletion date of both persons are outside the simulated time window (28-11-2012 ~31-12-2012)

After some investigation with the above example, it seems that the people4Hops does not take the creation dates of the knows edges into account. The path between 24189255812497 and 4398046513909 is 24189255812497;21990232566080;3903;6597069774812;4398046513909. However, the knows edges are created at later dates:

  • Edge between 24189255812497 and 21990232566080 created on December 17, 2012 9:19:22.307 PM (1355779162307)
  • Edge between 21990232566080 and 3903 created on November 29, 2012 3:22:30.993 AM (1354159350993)
  • Edge between 6597069774812 and 4398046513909 created on December 22, 2012 12:39:26.696 AM (1356136766696)

Thanks for reporting this! This is a pretty nasty bug but I have a workaround in mind.

This problem occurs because the factor generation step in the data generator is unaware of the benchmark's time window. So it has two choices:

  1. Only return person pairs between whom there is always a 4-hop path. The are two approaches for this:

    a. Only use persons and knows edges which always exist in the network. Doing this is simple but it has problems:

    • it will never return endpoints who have been freshly inserted to the network during the benchmark
    • there are likely going to be <4-hop paths between these people at some point during the network due to the addition of new persons/knows edges

    b. Take the temporal attributes into account: when calculating 2-hop paths by joining two knows edges k1 and k2, calculate the intersection of their interval, i.e. [max(k1.creationDate, k2.creationDate), min(k1.deletionDate, k2.deletionDate)]. This interval may very well be empty, in which case this is not an actual 2-hop path in the graph and should be discarded. (4-hop paths can then be found by taking two 2-hop paths and intersecting them in the middle.)

    This approach (1b) could work in theory but requires some work in the factorgen.

  2. Return pairs of persons between whom there is potentially no path in the given time interval (I call this "overapproximation", returning a set of pairs some of whom will not work for our parameters) and later filter on them.

    This approach has the advantages that: (1) we can do it in the paramgen, which is easier to develop & cheaper to iterate on than the factorgen; (2) we can be more granular for SNB Interactive: we can ensure that the two persons have an exactly 4-hop shortest path between them on a given day.

    Its disadvantage is that our "factors" now include the person_knows_person with attributes (person1id, person2id, day of creationDate, day of deletionDate), so it's hardly a "factor table" anymore. For SF10,000, the factors are 20GB in total and the personKnowsPersonDays table uses 15GB.

I actually implemented approach 2 for BI Q15b but didn't do so for BI Q15a. It's not trivial because doing 4 joins would definitely run out of memory on large SFs. I'll take a look and report back tomorrow.

So the solution for BI is relatively simple to implement: https://github.com/ldbc/ldbc_snb_bi/pull/104/files

The extra distinct is there as a naive implementation of factorization to limit the memory use of the long acyclic path.
Some benchmark results (for the full BI paramgen run): ldbc/ldbc_snb_bi#104 (comment)