proycon/codemetapy

graph creation of many entries fails with rekursion depth error

broeder-j opened this issue · 4 comments

I am not sure if this is related to the number of jsonld files or the content. Number of jsonLD files >1200.
Generation of subgraphs of this data is possible so I assume the quantity is the issue also because of an recursion error. But I have yet not tried out if I can create a subgraph for really ALL these files or not.

codemetapy --graph codemeta_results/git_*/*/*/codemeta_*.json > graph.json
...
  File "/home//work/git/codemetapy/codemeta/serializers/jsonld.py", line 182, in embed_items
    return embed_items(itemmap[data[idkey]], itemmap, copy(history))
  File "/usr/lib/python3.8/copy.py", line 72, in copy
    cls = type(x)
RecursionError: maximum recursion depth exceeded while calling a Python object

So the current graph serializer does not scale. I had this for different jsonld file sets where it fails after 2000 files or so and errors after different files.

The default python recursion depth is around 1000.

I have not looked into the code itself, but either one has to get rid of the rekursion, or one could first serialize batches and combine these in the end if combination of json graph files scales.

This indeed seems a bug but should not be related to the number of files. It is failing when trying to expand the JSON-LD representation because of some cycle in the graph (even though the code protects against that, but that's where I guess something is going wrong). I'd be interested in seeing exactly the file where it fails and I wonder if it can be pinpointed to a single file even.

There are some left-over debug statements still left in the code which you could enable to see where it fails: https://github.com/proycon/codemetapy/blob/master/codemeta/serializers/jsonld.py#L178 . If you send me the input files I can try reproduce it.

The default python recursion depth is around 1000.

and I intend not to get anywhere near that ;) That'd be bad design.

one could first serialize batches and combine these in the end if combination of json graph files scales.

that could work yes

The files from the last printout

Adding json-ld file from filex/codemeta_harvested.json to graph
    Found main resource with URI xx/snapshot

Do not fail if serialized to a graph alone or with a few files, so it really depends on the collected history.
I will investigate this and let you know.

I have implemented some fixes ( to be released in 2.4.0) that should hopefully prevent this bug, although it may still result in big serialisations as codemetapy expands things quite eagerly.