ChenghaoMou/text-dedup

minhash_spark.py [UNABLE_TO_INFER_SCHEMA]

Closed this issue · 3 comments

When I use the Spark cluster to execute minhash_spark.py, I occasionally encounter [UNABLE-TO-INFER-SCHEMA] errors, as shown in the following figure. I don't know if it's a problem with the data. Because workers need to copy data to different machines. For files with errors, they can run normally after retransmission, but errors may also occur after a period of time. I don't know if the file movement or reading has an impact on Spark? Now I have set up an NFS server, which can ensure that the files read by each worker are consistent, but this problem still occurs. Can you help me analyze where the problem lies?

image

image

I have found a solution in the following issue, which seems to be an error in spark reading checkpoint.

graphframes/graphframes#201

Modify the parameter of connectedComponents to algorithm="graphx" and it worked.
image

However, it takes longer than the default algorithm.

Thanks for sharing all the details. Could you verify that your checkpoint location is writable to Spark?

Based on the conversations in the issue linked, it does not seem there is something I can do to "solve" it other than checking the checkpoint write access. The default distributed and iterative algorithm was the whole reason why I chose it in the first place to speed it up.

I will add this issue to a QA section in case anyone encounter the same issue in the future.

Stale issue message