[PYTHON] to_pandas_series_rdd() does not work
Opened this issue · 0 comments
When I try to use the method to_pandas_series_rdd
of any TimeSeriesRDD, I get this error.
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o231.getnewargs. Trace:
py4j.Py4JException: Method getnewargs([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
I don't know if anyone else have the same problem but after searching about this error, I realized that the error was caused because of the index().to_pandas_index()
, the lazy evaluation makes a spark expression been called in the map
, so I rewrote the method like this using deep copy and I don't get the error anymore.
def to_pandas_series_rdd(self):
""" Returns an RDD of Pandas Series objects indexed with Pandas DatetimeIndexes """
def get_index(self):
index = self.index().to_pandas_index()
indexDC = index.copy(name="index", deep=True)
return indexDC
index = get_index(self)
return self.map(lambda x: (x[0], pd.Series(x[1], index) ) )