BUG: service stopped when pivot a 1125138913x5 matrix into 4000 columns on a 160U-4096GBmem machine
Closed this issue · 1 comments
huboqiang commented
Describe the bug
Running pivot and service stopped:
To Reproduce
To help us to reproduce this bug, please provide information below:
- Your Python version 3.11.4
- The version of Xorbits you use 0.5.1
- Versions of crucial packages, such as numpy(1.25.2), scipy(1.11.2) and pandas(2.0.3)
- Full stack of the error.
……
……
……
/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/dataframe/base/pivot.py:72: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
input_data[dtype] = np.nan
/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/dataframe/base/pivot.py:72: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
input_data[dtype] = np.nan
/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/dataframe/base/pivot.py:72: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
input_data[dtype] = np.nan
2023-10-13 19:20:03,359 xorbits._mars.services.scheduling.worker.execution 337003 ERROR Failed to run subtask n6Qo8tTl8A6Pn7uCwJHxGAEK on band numa-0
Traceback (most recent call last):
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 445, in _run_subtask_once
return await asyncio.shield(aiotask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/subtask/api.py", line 70, in run_subtask_in_slot
return await ref.run_subtask.options(profiling_context=profiling_context).send(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 226, in send
result = await self._wait(future, actor_ref.address, send_message) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 115, in _wait
return await future
^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/core.py", line 84, in _listen
raise ServerClosed(
xoscar.errors.ServerClosed: Remote server unixsocket:///361854215913472 closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 402, in internal_run_subtask
subtask_info.result = await self._retry_run_subtask(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 513, in _retry_run_subtask
return await _retry_run(subtask, subtask_info, _run_subtask_once)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 95, in _retry_run
raise ex
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 73, in _retry_run
return await target_async_func(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 491, in _run_subtask_once
raise ex
xoscar.errors.ServerClosed: unexpectedly terminated process (Remote server unixsocket:///361854215913472 closed) with address 127.0.0.1:39737, which is highly suspected to be caused by an Out-of-Memory (OOM) problem
2023-10-13 19:20:03,454 xorbits._mars.services.task.execution.mars.stage 337003 ERROR Subtask n6Qo8tTl8A6Pn7uCwJHxGAEK errored
Traceback (most recent call last):
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 445, in _run_subtask_once
return await asyncio.shield(aiotask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/subtask/api.py", line 70, in run_subtask_in_slot
return await ref.run_subtask.options(profiling_context=profiling_context).send(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 226, in send
result = await self._wait(future, actor_ref.address, send_message) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 115, in _wait
return await future
^^^^^^^^^^^^
File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/core.py", line 84, in _listen
raise ServerClosed(
xoscar.errors.ServerClosed: Remote server unixsocket:///361854215913472 closed
……
……
……
- Minimized code to reproduce the error.
df = xpd.read_parquet("./merged.pq")
print(df.shape) # (1125138913, 5)
df_pvt = df.pivot(index=[0,1,2], columns=[4], values=[3])
df_pvt
Expected behavior
A clear and concise description of what you expected to happen.
I think there were some bug for large scale data and the server closed.
Additional context
Add any other context about the problem here. No.
huboqiang commented