opencypher/cypher-for-gremlin

Unable to filter out paths containing self edges using a cypher Match query.

pushkarnagpal opened this issue · 5 comments

Trying the following query to get a path using a match query to filter out paths that contain self edges (node with edges to itself to avoid increased length of path:

match p = (base:party{node_id:'245793137123'})<-[trail:ownership*..]-(leaves:party) where none( rel in relationships(p) where startNode(rel) = endNode(rel) ) return count(p)

But this query is unable to filter out the data and I get the entire set.

Hello @pushkarnagpal,

Are you sure you've provided a valid Cypher query?

It does not compile in Cypher for Gremlin and Neo4j gives error about multiple WHERE cases.

image

Hello @pushkarnagpal,

Are you sure you've provided a valid Cypher query?

It does not compile in Cypher for Gremlin and Neo4j gives error about multiple WHERE cases.

image

Hi @dwitry
I've updated the query now. (had accidentally written not instead of none in the query.

match p = (base:party{node_id:'245793137123'})<-[trail:ownership*..]-(leaves:party) where none( rel in relationships(p) where startNode(rel) = endNode(rel) ) return count(p)

Please look into the issue now. Thanks!

Variable length path in loops is a complex case, which is tricky to implement in Gremlin to fully cover all the edge cases. I've invested a lot of time trying to properly implement it (1, 2) but the solution still not universal. Unfortunately can not provide any estimates when this will be improved.

Variable length path in loops is a complex case, which is tricky to implement in Gremlin to fully cover all the edge cases. I've invested a lot of time trying to properly implement it (1, 2) but the solution still not universal. Unfortunately can not provide any estimates when this will be improved.

Ahh okay. I was working on a fraud detection case.and got stuck here. I saw this query conversion from the cypher in the gremlin-server.log file, and I see a .neq(' cypher.null') in the query and i assume this step is occuring at the none where clause.
I think this is the root cause of the bug.

g.V().as('base')
	.hasLabel('party').has('node_id', eq('245793137123'))
	.repeat(__
				.inE('ownership')
				.as('trail')
				.aggregate('  cypher.path.edge.p')
				.outV()
		)
	.emit()
	.until(__
			.path()
			.from('base')
			.count(local)
			.is(gte(21))
		)
	.as('leaves')
	.hasLabel('party')
	.path()
	.from('base')
	.as('p')
	.where(__
			.not(__.V().hasLabel('party').outE('ownership').inV().as('  GENERATED4').where(__.select('  GENERATED4').where(eq('leaves'))))
		)
	.optional(__
		.select(all, 'trail').as('trail')
		)
	.select('p')
	.is(neq('  cypher.null'))
	.count()
	.project('count(p)')
	.by(__.identity())

Hope I could help somehow with this. Thanks!

.neq(' cypher.null') is guard to handle cypher.null which represents null in Cypher for Gremlin.

We may call "root cause" the following snippet:

.repeat(__
            .inE('ownership')
            .as('trail')
            .aggregate('  cypher.path.edge.p')
            .outV()
    )
.emit()
.until(__
        .path()
        .from('base')
        .count(local)
        .is(gte(21))
    )

This is how variable length path <-[trail:ownership*..]- is translated. Basically, it is an imperative way to traverse relationships (21 is limit to avoid infinite loops). If there is a loop, the path will be traversed repeatedly (what happens in the query you've provided).

So we need to find a better implementation of this in Gremlin, but as I've said in the previous message this is a rather complicated task, as there are other edge cases that need to be considered, as well as performance.