When Client and Server items are in some certain sizes, server.CreateSetupMessage raise ValueError: basic_string::resize

Question

When Client and Server items are in some certain sizes, server.CreateSetupMessage raise ValueError: basic_string::resize

chesterxgchen opened this issue 2 years ago · 13 comments

Description

When client items size and server items size are in certain values, the PSI code throw basic_string::resize Value Error.
For example, using the tests.py https://github.com/OpenMined/PSI/blob/master/private_set_intersection/python/tests.py#L28

if we change the client items and server items range into the following values:
client_items = ["Element " + str(i) for i in range(333334)]
server_items = ["Element " + str(2 * i) for i in range(397)]

You should be able to get the exception

File ~/nvflare-env2/lib/python3.8/site-packages/openmined_psi/__init__.py:140, in server.CreateSetupMessage(self, fpr, num_client_inputs, inputs)
    129 def CreateSetupMessage(
    130     self, fpr: float, num_client_inputs: int, inputs: List[str]
    131 ) -> ServerSetup:
    132     """Create a setup message from the server's dataset to be sent to the client.
    133     Args:
    134         fpr: the probability that any query of size `num_client_inputs` will result in a false positive.
   (...)
    138         A Protobuf with the setup message.
    139     """
--> 140     interm_msg = self.data.CreateSetupMessage(fpr, num_client_inputs, inputs).save()
    141     msg = ServerSetup()
    142     msg.ParseFromString(interm_msg)

ValueError: basic_string::resize

How to Reproduce

check the code in https://github.com/chesterxgchen/psi_mpc_related/blob/main/psi/psi_intersect.ipynb
command Cell No. 5 should re-produce the error.
The code in this notebook is essentially the same as https://github.com/OpenMined/PSI/blob/master/private_set_intersection/python/tests.py#L28
pip install the openmined.psi
start the jupyter notebook with https://github.com/chesterxgchen/psi_mpc_related/blob/main/psi/psi_intersect.ipynb
execute the import the psi packages, and def dup() function, then run cmd 5, to see error

Expected Behavior

No error

Screenshots

System Information

OS: Ubuntu
OS Version: 20.04
Openmined PSI version: openmined.psi-0.3.5 ( directly downloaded from Pipy)
Language Version: python 3.8.16 (default, Dec 7 2022, 01:12:06)
Package Manager: Version pip 22.3.1
Browser (if applicable): NA
Browser Version (if applicable): NA

Additional Context

We experience this issue while testing our openmined-based PSI solution

Answer 1 · 2023-01-05T19:26:21.000Z

This is a bug with the GCS data structure and having an extremely low fpr. The data structure enum (GCS, BloomFilter) was added much later and was never exposed as a configurable parameter in Python's binding for CreateSetupMessage. We will publish a new release to allow you to specify BloomFilter which temporarily solves the underlying problem.

Answer 2 · 2023-01-06T00:11:18.000Z

great thanks

Answer 3 · 2023-01-06T04:14:04.000Z

This has been fixed in https://github.com/OpenMined/PSI/releases/tag/v1.0.1

Answer 4 · 2023-01-06T04:53:58.000Z

@s0l0ist Nick, seem when the fpr is small 1e-11, it still fails. I updated the notebook:
https://github.com/chesterxgchen/psi_mpc_related/blob/main/psi/psi_intersect.ipynb
In cell 10, you will get
ValueError: basic_string::_M_replace_aux

But I did try with larger fpr=1e-9, the code seems to work

Answer 5 · 2023-01-09T01:21:01.000Z

Yep, looks like smaller false positive rate causes the backing GCS datastructure to try and allocate too much memory. This is 'fixed' in the sense that you must use the psi.DataStructure.BLOOM_FILTER enum and pass it into CreateSetupMessage as the last argument (overriding the default GCS option when not specified).

So I admit, the bug still exists in GCS, but the workaround is to use bloom filters.

setup = dup(
        duplicate, s.CreateSetupMessage(fpr, len(client_items), server_items, psi.DataStructure.BLOOM_FILTER), psi.ServerSetup()
)

Answer 6 · 2023-01-09T06:00:19.000Z

Let me try this. I don't mind extra argument parameter.

Answer 7 · 2023-01-09T06:21:26.000Z

This works:

setup = dup(
        duplicate, s.CreateSetupMessage(fpr, len(client_items), server_items, psi.DataStructure.BLOOM_FILTER), psi.ServerSetup())

Answer 8 · 2023-01-09T14:19:15.000Z

Ah yes, I put the param outside of the function. Oops.

Answer 9 · 2023-01-10T01:22:50.000Z

@s0l0ist Nick,
Unfortunately, the work around doesn't work. adding extra filter fixed the exception. But now the resulting result is incorrect in some cases. For example:

https://github.com/chesterxgchen/psi_mpc_related/blob/main/psi/psi_intersect.ipynb

Compare Cell 11 and 14. Cell 11 is with extra argument, Cell 14 is not.

In Cell 11, the resulting intersect is 19 items which is larger than the Server Items of 17. Noticed that few larger numbers is much larger than max number of the item from server.

Cell 14 is without extra argument, the result is correct.

Answer 10 · 2023-01-10T15:15:01.000Z

Looking at the example code, it looks like the client has many more values than the server. This will yield much higher chances of receiving a false positive value in return and isn't the intended way of how to use this library regardless of the backing data structure used.

The protocol is asymmetric in the sense that both client and server sets can be different sizes, but the real use case is for when the client has much fewer items in its set and wants to check against a server with many more items in its set.

If you flip the client and server set sizes, you should get more reasonable results in expectation of the FPR you've set.

Answer 11 · 2023-01-11T16:56:34.000Z

Nick,
This is a different use case. In Federated Learning (FL) case for example, we have two sites participating to the FL study.
They need to find out the intersect of the users (user id match) before they can do the training. Different from the normal social contact use cases for PSI, we need let both sites know the intersection, not just one site.
see site-1 has 1,000 user_ids
site-2 has 10,000,000 user ids
We can treat site-1 as PSI client, site-2 as PSI Server ==> find the intersect, assume the intersection size = 100
at this point, the intersection is only known to site-1. Site-2 doesn't know it.
so we have to reverse the process, make site-1 as PSI server, and site-2 as PSI Client , and take the intersection as the site-1's items ==> find the intersect, which gives site-2 the intersections. In the reverse process, site-2 items = 10,000,000, site-1 items = 100.

All the examples I showed in the notebook ( client and server items) are not arbitrarily chosen, they are actually from test failure cases, which I simplified to put notebook

Answer 12 · 2023-01-14T00:30:24.000Z

I see, yeah this is a bug. I'll investigate.

Answer 13 · 2023-01-14T20:10:34.000Z

There was a bug that produced more erroneous results than normal; however, it is quite possible for false positives to manifest themselves in the form of intersection index values that are out of range of the original client set. Checkout the my comments in the PR to handle those cases.