对于星核杯隐匿查询案例代码运行有关Ray报错
Opened this issue · 31 comments
Issue Type
Bug
Source
binary
Secretflow Version
secretflow 1.0.0b3
OS Platform and Distribution
Asianux
Python version
3.8.16
Bazel version
No response
GCC/Compiler version
No response
What happend and What you expected to happen.
运行星河杯隐匿查询代码报错(数据量:1000w), 提示是节点在内存上运行太慢,工作进程被杀。
运行环境:4核8G内存
swap_space: 20G
请问应该对ray集群进行怎样的配置以满足需求。
Reproduction code to reproduce the issue.
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class name: SPURuntime
The actoris dead because its worker process has died. Worker exit type: NODE.OUT_OF MEMORY Worker exit detail: Task was lilled due to the node running low on memory
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html.
Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task.
Set max restarts and max task retries to enable retry when the task crashes due to OOM.
To adjust the kill threshold, set the environment variable 'RAY_memory_usage_ threshold' when starting Ray.
To disable worker killing, set the environment variable 'RAY_memory_monitor refresh ms' to zero.
@integrationex01 你好,我们的CPU/Memory推荐最低配置是 8C16G,建议使用所要求的机型。
目前可以通过清理磁盘并减少数据量的方式再次尝试运行。
@integrationex01 你好,我们的CPU/Memory推荐最低配置是 8C16G,建议使用所要求的机型。 目前可以通过清理磁盘并减少数据量的方式再次尝试运行。
我主要是想要知道隐语对于Ray集群的详细配置操作是否只有对sf.init()中的几项参数可以操作,还是说可以直接对ray集群进行操作。
@integrationex01 你好,我们的CPU/Memory推荐最低配置是 8C16G,建议使用所要求的机型。 目前可以通过清理磁盘并减少数据量的方式再次尝试运行。
我主要是想要知道隐语对于Ray集群的详细配置操作是否只有对sf.init()中的几项参数可以操作,还是说可以直接对ray集群进行操作。
可以参考集群仿真模式,先启动一个 ray 集群,再让 secretflow 连接到已启动的集群。
此外,ray.init() 的大部分参数可以用过 sf.init() 传入:https://github.com/secretflow/secretflow/blob/main/secretflow/device/driver.py#L519
修改对应核心教以及内存容量后报错:
目前运行环境为:
16C64G
(SPURuntime(device id=None, party=alice) pid=5496) 2024-04-28 14:51:36.840 [error] [global.cpp:BRPC:306] external/com_google _protobuf/src/google/protobuf/message_lite.cc:480
spu.psi.proto.QueryResponseProto exceeded maximum protobuf size of 2GB: 2502202380
(SPURuntime(device_id=None, party=bob) pid=5500) 2024-04-28 14:51:40.695 [critical] [global.cpp:BRPC:309] external/com_google_protobuf/src/google/protobuf/stubs/stringpiece.cc:50
size too big: 18446744071916786700 details: string length exceeds max size
Traceback (most recent call last):
File "MilkRiverCup.py", line 158, in
app.run(main)
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8/site-packages/absl/app.py", line 308, in run
run main(main, args)
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "MilkRiverCup.py", line 147, in main
reports = spu.pir_query(
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/spu.py", line 1950, in pir_query
return dispatch(
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/register.py", line 111, in dispatch
return _registrar.dispatch(self.device_type, name, self, *args, **kwargs)
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/register.py", line 80, in dispatch
return self._ops[device_type][name](*args, **kwargs)
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8ksite-packages/secretflow/device/kernels/spu.py", line 522, in pir_query
return sfd.get(res)
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/distributed/primitive.py", line 75, in get
return ray.get(object_refs)
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8/site-packages/ray/_private/worker.py", line 2309, in get
raise value.as_instanceof_cause0)
ray.exceptions.RayTaskError(RuntimeError): ray.SPURuntime.pir_query0 (pid=5500, repr=SPURuntime(device_id=None, party=bob)
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/spu.py", line 1181, in pir_query
report = pir.pir_client(self.link, config)
File "/app/bspdev/anaconda3/envs/secretflow/lib/python3.8/site-packages/spu/pir.py", line 56, in pir_client
report_str = libspu.libs.pir_client(link, config.SerializeToString0)
RuntimeError: size too big: 18446744071916786700 details: string length exceeds max size
在WSL环境:
8C8G 环境
查询耗时:2976.5s
数据预处理耗时:6200.5s
数据预处理大小:4.7G
运行成功
请问这个错误如何处理呢?
辛苦将贴一下运行代码
辛苦将贴一下运行代码
代码如下:
import os
import sys
import time
import logging
import multiprocessing
from absl import app
import spu
import secretflow as sf
import pandas as pd
from pathlib import Path
#import random
# init log
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# SPU settings
# alice as pir client
# bob as pir server
cluster_def = {
'nodes': [
{
'party': 'alice',
'id': 'local:0',
'address': f'127.0.0.1:17268',
# 'tls_opts': {
# 'server_ssl_opts': {
# 'certificate_path': 'alice servercert.pem',
# 'private_key_path': 'alice serverkey.pem',
# # The options used for verify peer's client certificate
# 'ca_file_path': 'cacert.pem',
# # Maximum depth of the certificate chain for verification
# 'verify_depth': 1
# },
# 'client_ssl_opts': {
# 'certificate_path': 'alice clientcert.pem',
# 'private_key_path': 'alice clientkey.pem',
# # The options used for verify peer's server certificate
# 'ca_file_path': 'cacert.pem',
# # Maximum depth of the certificate chain for verification
# 'verify_depth': 1
# }
# }
},
{
'party': 'bob',
'id': 'local:1',
'address': f'127.0.0.1:17269',
# 'tls_opts': {
# 'server_ssl_opts': {
# 'certificate_path': 'bob servercert.pem',
# 'private_key_path': 'bob serverkey.pem',
# # The options used for verify peer's client certificate
# 'ca_file_path': 'cacert.pem',
# # Maximum depth of the certificate chain for verification
# 'verify_depth': 1
# },
# 'client_ssl_opts': {
# 'certificate_path': 'bob clientcert.pem',
# 'private_key_path': 'bob clientkey.pem',
# # The options used for verify peer's server certificate
# 'ca_file_path': 'cacert.pem',
# # Maximum depth of the certificate chain for verification
# 'verify_depth': 1
# }
# }
},
],
'runtime_config': {
'protocol': spu.spu_pb2.SEMI2K,
'field': spu.spu_pb2.FM128,
},
}
link_desc = {
'recv_timeout_ms': 3600000,
}
def main(_):
# sf init
sf.init(['alice','bob'],address='local',log_to_driver=True,omp_num_threads=multiprocessing.cpu_count())
# init log
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
alice = sf.PYU('alice')
bob = sf.PYU('bob')
key_columns = ['name']
label_columns = ['age']
spu = sf.SPU(cluster_def, link_desc)
pir_input_path = f"{str(Path.home())}/pythonpath/alice_pir_input_1kw.csv"
pir_oprf_key_path = f"{str(Path.home())}/pythonpath/server_secret_key.bin"
pir_setup_path = f"{str(Path.home())}/pythonpath/alice_1kw_setup"
bob_df = pd.DataFrame({
# "name": ["tony", "bob"],
"name": ["李建华"],
})
bob_df.to_csv(f"{str(Path.home())}/pythonpath/bob_pir_query_1kw.csv", index=False)
bob_input_path = f"{str(Path.home())}/pythonpath/bob_pir_query_1kw.csv"
start = time.time()
# server setup
reports = spu.pir_setup(
server='alice',
input_path=pir_input_path,
key_columns=key_columns,
label_columns=label_columns,
oprf_key_path=pir_oprf_key_path,
setup_path=pir_setup_path,
num_per_query=1,
label_max_len=18,
)
#print(f"psi reports: {reports}")
logging.info(f"offline psi reports: {reports}")
logging.info(f"cost time: {time.time() - start}")
# client config
bob_config = {
'input_path': bob_input_path,
'key_columns': key_columns,
'output_path': f"{str(Path.home())}/pythonpath/bob_pir_result_1kw.csv",
}
# server config
alice_config = {
'oprf_key_path': pir_oprf_key_path,
'setup_path': pir_setup_path,
}
query_config = {
alice: alice_config,
bob: bob_config,
}
start = time.time()
reports = spu.pir_query(
server='alice',
config=query_config,
)
logging.info(f"online pir reports: {reports}")
logging.info(f"cost time: {time.time() - start}")
sf.shutdown()
if __name__ == '__main__':
app.run(main)
RuntimeError: size too big: 18446744071916786700 details: string length exceeds max size
尝试将“label_max_len=18“参数调整到符合长度或更大,再次运行下。
RuntimeError: size too big: 18446744071916786700 details: string length exceeds max size
尝试将“label_max_len=18“参数调整到符合长度或更大,再次运行下。
@Chrisdehe 您好,请问该参数设置时是否有对应的规则可以参考,即该参数与setup的数据量的关系。或者说是数据量setup大小与内存的关系。
label_max_len (int): Max number bytes of label, padding data to label_max_len Max label bytes length add 4 bytes(len).
我看到这篇用户指导:
https://github.com/secretflow/psi/blob/c80ac38fd8e0df9860001d46bc064153c9e203c6/docs/user_guide/pir.rst#L157中对于百万级别数据使用了label_max_len=288大小。
你现在的client是要查询1000万条嘛?
你现在的client是要查询1000万条嘛?
@6fj
是的:Alice设置了1000万条数据、Bob针对一个姓名进行查询
bob_pir_query_1kw.csv
这个文件包含了多少数量的query呢,只有一条嘛
bob_pir_query_1kw.csv
这个文件包含了多少数量的query呢,只有一条嘛
@6fj
是的,只有一条,这个是针对这个千万场景的一个query。
目前看到的问题是,brpc的message中间超出了serialization的限制,label_max_len 看看能不能小一点。当然也不能太小,但如果太小了装不下会有明确的报错。
目前看到的问题是,brpc的message中间超出了serialization的限制,label_max_len 看看能不能小一点。当然也不能太小,但如果太小了装不下会有明确的报错。
@6fj
具体的代码细节如果您方便麻烦给个连接。
是的,我目前遇到的就是这个问题(太大超出限制,太小装不下),在测试环境针对该问题进行了调试,预估在35-38附近,但是遇到了在int情况下不能满足需求的情况,所以就想要知道这个label_max_len和数据量之间的关系。目前我使用的机器是16C64G不知道增加内存是否能解决该问题呢?
你遇到这个问题,应该不是现有配置可以解决的,我们需要优化代码或者暴露更多的BFV参数。能否将你这边的数据脱敏传上来呢?
你遇到这个问题,应该不是现有配置可以解决的,我们需要优化代码或者暴露更多的BFV参数。能否将你这边的数据脱敏传上来呢?
请问如何发送给您这边呢。压缩后39M(input+query+数据生成python文件)
你遇到这个问题,应该不是现有配置可以解决的,我们需要优化代码或者暴露更多的BFV参数。能否将你这边的数据脱敏传上来呢?
请问如何发送给您这边呢。压缩后39M(input+query+数据生成python文件)
@6fj GitHub上传文件大小限制在了25M,所以需要其他方式。还有就是上述代码中的查询文件生成使用的名字“李建华”可能在alice_pir_input_1kw.csv文件中不存在,所以请先用excel对该文件进行预览。
@integrationex01 您可以通过邮箱(wdh01581486@antgroup.com)或者微信(技术支持:secretflow02)的方式来传输文件。
@Chrisdehe
已发送邮件到对应邮箱。
@6fj @Chrisdehe 您好,请问这个问题现在有解决方案了吗?
我们确实发现了一个bug,我们将会立刻修复。但是同时,你的数据中也存在重复key的情况,需要你处理一下,感谢。
我们确实发现了一个bug,我们将会立刻修复。但是同时,你的数据中也存在重复key的情况,需要你处理一下,感谢。
@6fj
1、可以告知一下相关bug的详细信息吗?
2、数据中重复key的情况是指有重名的人还是指的有两个列表中的对象各个参数均相同,重名的情况是有可能发生的请问只按照姓名查询会有影响吗?我看数据量低的情况是把重名的结果都搜出来了。
- 有一个数据量的检查写错了
- key有重复,和payload没有关系。keyword pir是不能允许重复的key存在的。
- 有一个数据量的检查写错了
- key有重复,和payload没有关系。keyword pir是不能允许重复的key存在的。
https://github.com/secretflow/psi/blob/c80ac38fd8e0df9860001d46bc064153c9e203c6/examples/pir/generate_pir_data.cc
尝试使用example中的cpp代码实现测试数据的生成,报如下错误:
bazel run //examples/pir:generate_pir_data -c opt -- -data_count 10000 -label_len 32 -server_out_path /tmp/pir_server.csv -client_out_path /tmp/pir_client.csv
INFO: Analyzed target //examples/pir:generate_pir_data (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /home/chenhengji/.cache/bazel/_bazel_chenhengji/b4d87c2493d0f3b6fda6d6d62517e99c/external/yacl/yacl/crypto/rand/BUILD.bazel:19:16: Compiling yacl/crypto/rand/rand.cc failed: (Exit 1): gcc failed: error executing command (from target @yacl//yacl/crypto/rand:rand) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 62 arguments skipped)
Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
In file included from external/yacl/yacl/crypto/rand/rand.h:25,
from external/yacl/yacl/crypto/rand/rand.cc:15:
external/yacl/yacl/base/dynamic_bitset.h: In instantiation of 'constexpr yacl::dynamic_bitset<Block, Allocator>::dynamic_bitset(yacl::dynamic_bitset<Block, Allocator>::size_type, uint128_t, const allocator_type&) [with Block = __int128 unsigned; Allocator = std::allocator<__int128 unsigned>; yacl::dynamic_bitset<Block, Allocator>::size_type = long unsigned int; uint128_t = __int128 unsigned; yacl::dynamic_bitset<Block, Allocator>::allocator_type = std::allocator<__int128 unsigned>]':
external/yacl/yacl/crypto/rand/rand.cc:105:1: required from here
external/yacl/yacl/base/dynamic_bitset.h:2296:20: error: division by zero is not a constant expression
2296 | constexpr size_t init_val_required_blocks = u128_bits_number / bits_per_block;
| ^~~~~~~~~~~~~~~~~~~~~~~~
Target //examples/pir:generate_pir_data failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 140.731s, Critical Path: 130.96s
INFO: 121 processes: 8 internal, 113 linux-sandbox.
FAILED: Build did NOT complete successfully
ERROR: Build failed. Not running target
hi,请问您的gcc版本号是多少,这边建议使用11.4再尝试下
hi,请问您的gcc版本号是多少,这边建议使用11.4再尝试下
@lq0404510 您好,以将gcc升级到11.4版本,bazel编译还是有如下问题:
WARNING: Download from https://golang.org/dl/?mode=json&include=all failed: class java.io.IOException connect timed out
INFO: Analyzed target //examples/pir:generate_pir_data (82 packages loaded, 6802 targets configured).
INFO: Found 1 target...
ERROR: /home/chenhengji/cppPath/secretflow/psi/examples/pir/BUILD.bazel:19:14: Linking examples/pir/generate_pir_data failed: (Exit 1): gcc failed: error executing command (from target //examples/pir:generate_pir_data) /usr/bin/gcc @bazel-out/k8-opt/bin/examples/pir/generate_pir_data-2.params
Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_globallookup: error: undefined reference to 'dlopen'
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_globallookup: error: undefined reference to 'dlsym'
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_globallookup: error: undefined reference to 'dlclose'
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_pathbyaddr: error: undefined reference to 'dladdr'
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_pathbyaddr: error: undefined reference to 'dlerror'
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_bind_func: error: undefined reference to 'dlsym'
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_bind_func: error: undefined reference to 'dlerror'
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_load: error: undefined reference to 'dlopen'
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_load: error: undefined reference to 'dlclose'
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_load: error: undefined reference to 'dlerror'
bazel-out/k8-opt/bin/external/com_github_openssl_openssl/openssl/lib/libcrypto.a(libcrypto-lib-dso_dlfcn.o):dso_dlfcn.c:function dlfcn_unload: error: undefined reference to 'dlclose'
collect2: error: ld returned 1 exit status
Target //examples/pir:generate_pir_data failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 339.269s, Critical Path: 124.49s
INFO: 157 processes: 23 internal, 134 linux-sandbox.
FAILED: Build did NOT complete successfully
ERROR: Build failed. Not running target```
hi!您可以先在psi目录下的.bazelrc文件中添加一行命令:build --linkopt=-ldl,再尝试下
我们确实发现了一个bug,我们将会立刻修复。但是同时,你的数据中也存在重复key的情况,需要你处理一下,感谢。
@6fj @Chrisdehe 您好,我对代码进行了修改,通过Index以及name共同作为keylabel进行查询,目前生成测试数据的代码以及运行隐匿查询的代码已经发送到对应邮箱。
我目前还没有找到对应修改的commit,请问针对该问题的修改提交了吗?已提交的话是否集成到secreflow了呢?
hi!您可以先在psi目录下的.bazelrc文件中添加一行命令:build --linkopt=-ldl,再尝试下
@lq0404510 感谢,成功了。
https://github.com/secretflow/psi/blob/c80ac38fd8e0df9860001d46bc064153c9e203c6/examples/pir/generate_pir_data.cc 尝试使用上述的cpp代码实现测试数据的生成,报如下错误:
bazel run //examples/pir:generate_pir_data -c opt -- -data_count 10000 -label_len 32 -server_out_path /tmp/pir_server.csv -client_out_path /tmp/pir_client.csv INFO: Analyzed target //examples/pir:generate_pir_data (0 packages loaded, 0 targets configured). INFO: Found 1 target... ERROR: /home/chenhengji/.cache/bazel/_bazel_chenhengji/b4d87c2493d0f3b6fda6d6d62517e99c/external/yacl/yacl/crypto/rand/BUILD.bazel:19:16: Compiling yacl/crypto/rand/rand.cc failed: (Exit 1): gcc failed: error executing command (from target @yacl//yacl/crypto/rand:rand) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 62 arguments skipped) Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging In file included from external/yacl/yacl/crypto/rand/rand.h:25, from external/yacl/yacl/crypto/rand/rand.cc:15: external/yacl/yacl/base/dynamic_bitset.h: In instantiation of 'constexpr yacl::dynamic_bitset<Block, Allocator>::dynamic_bitset(yacl::dynamic_bitset<Block, Allocator>::size_type, uint128_t, const allocator_type&) [with Block = __int128 unsigned; Allocator = std::allocator<__int128 unsigned>; yacl::dynamic_bitset<Block, Allocator>::size_type = long unsigned int; uint128_t = __int128 unsigned; yacl::dynamic_bitset<Block, Allocator>::allocator_type = std::allocator<__int128 unsigned>]': external/yacl/yacl/crypto/rand/rand.cc:105:1: required from here external/yacl/yacl/base/dynamic_bitset.h:2296:20: error: division by zero is not a constant expression 2296 | constexpr size_t init_val_required_blocks = u128_bits_number / bits_per_block; | ^~~~~~~~~~~~~~~~~~~~~~~~ Target //examples/pir:generate_pir_data failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 140.731s, Critical Path: 130.96s INFO: 121 processes: 8 internal, 113 linux-sandbox. FAILED: Build did NOT complete successfully ERROR: Build failed. Not running target
hi,咨询两个问题 1、这个问题您使用的gcc版本是多少?2、这两次gcc的安装您是通过什么途径安装的(yum、conda或者其他方式)
hi @integrationex01
我们确实发现了一个bug,我们将会立刻修复。但是同时,你的数据中也存在重复key的情况,需要你处理一下,感谢。@6fj @Chrisdehe 您好,我对代码进行了修改,通过Index以及name共同作为keylabel进行查询,目前生成测试数据的代码以及运行隐匿查询的代码已经发送到对应邮箱。 我目前还没有找到对应修改的commit,请问针对该问题的修改提交了吗?已提交的话是否集成到secreflow了呢?
hi,您可以将您的conda环境中的python升级至python3.10,然后进行安装pip install secretflow==1.6.1b0