rajasekarv/vega

remove serialization of duplicate data in dependencies along with task

rajasekarv opened this issue · 3 comments

remove serialization of duplicate data in dependencies along with task

Hi, when I ran a sample called 'Transitive closure on a graph', the typical sample in Spark https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkTC.scala. I found that the total number of bytes grew too fast to run to completion. Only two or three iterations will exhaust my memory. The problem seems related to this issue. If I want to contribute to it, what's the main problem when solving, and could you please give me some hints?

Hi, I've finished it. Thanks.

Hello @AmbitionXiang

Hope you are doing well. Thanks for checking it and bringing out the issue. Yeah, due to data duplication in serialization, it can go out of memory very quickly if the data flow branches out a lot. It is a long-pending issue and since I am busy with personal work, I never got time to work on it. 
I plan to resume the work on the project in about a month and I will be managing it actively this time. If you have done some work please raise a Pull Request and I will merge it after reviewing it. 
Thanks a lot for your support