Clarifications on aggregation service batches + enhanced debugging possibilities

Question

Clarifications on aggregation service batches + enhanced debugging possibilities

CGossec opened this issue 10 months ago · 8 comments

Hello aggregation service team,

We (Criteo) would like to seek clarification on a couple of points to ensure we have a comprehensive understanding of certain features.
Your insights will greatly assist us in optimizing our utilization of the platform:

Batch Size Limit (30k reports):
Could you kindly provide more details about the batch size limit of 30,000?
We are a little unsure as to how this limit behaves: it is our understanding that the aggregation service will expect loads of up to tens (even hundreds) of thousands of reports. However when we provide it with batches of 50k+ reports, our aggregations fail.
Is the limit of 30k a limit that is to be enforced per avro file within the batch? Per batch overall?
If it is per overall batch, is there any kind of suggestion on your side to aggregate batches of more than 30k reports?
If we need to split these larger aggregations over several smaller requests, that will greatly increase the noise levels we see in our final results, and would work against the idea of the aggregation service, which encourages adtechs to aggregate as many reports as possible to increase privacy.
Understanding the specifics of this limit should greatly help us in tailoring our processes more effectively.
Debug Information on Privacy Budget Exhaustion:
We've been considering ways to enhance our debugging capabilities, especially in situations where the privacy budget is exhausted. Would it be possible to obtain more detailed debug information in such cases, specifically regarding the occurrence of duplicates? We believe that having for instance the report_ids of the duplicates wouldn't compromise privacy, and would significantly contribute to our troubleshooting efforts.

Answer 1 · 2023-12-11T17:47:44.000Z

On (1), I asked about a related point and this may shed light on your question. I guess you need to split your batches by destination site:

Q: 1a. This doc recommends batching by “reports generated for a given advertiser on a given date”, but how can we know the advertiser? I don't think an advertiser id can be supplied to the .well-known API.

A: 1a) The recommendation to batch per advertiser with understanding that each advertiser will have their own set of bucket/key structure for aggregation. For ARA, when you perform the trigger registration, the destination site will be available in the shared_info field.
1b) We do recommend to batch by advertiser to avoid hitting any limits on the privacy budget per batch. Each report will have a shared_id. And each unique shared_id will be 1 privacy budget. Shared ID is a combination of API version, reporting origin, destination site, source registration time and scheduled report time that is obtained from the report's shared_info field.

Answer 2 · 2023-12-15T22:36:24.000Z

Hi @CGossec ,

Please find the below response:

Batch Size Limit

We do not have a limit on the number of reports per batch. You will however need to ensure that you have the right sized Aggregation Service Instance according to the sizing guide.

Are you receiving the following error: PRIVACY_BUDGET_ERROR? We do recommend to batch per advertiser to avoid hitting any limits on the privacy budget per batch on Aggregation Service. Each Shared ID will have it's own privacy budget. Each report will have a Shared ID based on the combined data points of API version, reporting origin, destination site, source registration time and scheduled report time from the report's shared_info field.
Debug Information on Privacy Budget Exhaustion

Currently, we are looking into solutions to provide more details on the exhausted budget. We will update the Aggregation Service github page once we have a proposal.

For the immediate term, we do recommend to batch by scheduled_report_time. Since Aggregation Service keeps track of Shared ID for disjoint batches, using the scheduled_report_time will ensure that the same shared id will not appear across jobs.

Answer 3 · 2023-12-20T14:45:48.000Z

Hello,

Batch Size Limit

We are indeed observing PRIVACY_BUDGET_ERROR .

We do recommend to batch per advertiser to avoid hitting any limits on the privacy budget per batch on Aggregation Service.

Could you please provide more details on which type of limits you are referring to? As mentioned in our initial comment, on our side we observe a limit of 30000 reports per batch which seems to be enforced by this line of code: https://github.com/privacysandbox/control-plane-shared-libraries/blob/main/java/com/google/scp/operator/cpio/privacybudgetclient/HttpPrivacyBudgetClient.java#L72

Additionally, the sizing guide referenced seems to indicate that even the smallest possible instances, m5.2xlarge, should support up to 10M reports and 70M domain keys, which is far above the 30k observed limitations. According to your answer, we shouldn't be encountering these issues. Is something maybe off with our usage of the aggregation service, or are these expected behavior?

Debug Information on Privacy Budget Exhaustion

As per your recommendations, we are currently applying a batching using scheduled_report_time. This method covers a large amount of our reports, but misses some. We therefore have a more sophisticated deduplication mechanism in mind which, while still relying on scheduled_report_time, would allow us to aggregate approximately 2% additional reports that are currently being missed. In our experiments and using the Shared ID mentioned in your documentation, we are finding no duplicates in the submitted batches, whereas the aggregation service still fails when running aggregation jobs: having extra debug information would prove very useful.

Answer 4 · 2024-01-02T17:30:03.000Z

Hi @CGossec ,

Thanks for your patience.

Batch Size Limit:

The 30000 is the current limit of privacy budget per batch. Each report will have a shared ID which will be equivalent to 1 privacy budget. Each shared ID is obtained from the report's shared_info field where it takes the api version, reporting origin, destination site, source registration time (truncated by the day) and scheduled report time (truncated by the hour).

An example is if you have the following shared_info field (below), you can see that the API is the same (attribution-reporting), the attribution_destination is the same (https://privacy-sandcastle-dev-shop.web.app), the reporting_origin is the same (https://privacy-sandcastle-dev-dsp.web.app). The source_registration_time is the same (0). So, we only have scheduled_report_time which is different. But if we take a look at the scheduled_report_time, one is "Tuesday, January 2, 2024 5:19:12 PM" and the other report is "Tuesday, January 2, 2024 5:24:22 PM" if we truncate them by the hour, they are both "Tuesday, January 2, 2024 5 PM". Which means that both reports have one privacy budget. So you can have hundreds/thousands/more reports which is equivalent to 1 privacy budget. All of the reports with the same shared id will have to go in the same batch.

"shared_info": "{"api":"attribution-reporting","attribution_destination":"https://privacy-sandcastle-dev-shop.web.app\",\"debug_mode\":\"enabled\",\"report_id\":\"af0cfc09-18d3-4234-8d02-1e36a189a7c4\",\"reporting_origin\":\"https://privacy-sandcastle-dev-dsp.web.app\",\"scheduled_report_time\":\"1704215952\",\"source_registration_time\":\"0\",\"version\":\"0.1\"}",

"shared_info": "{"api":"attribution-reporting","attribution_destination":"https://privacy-sandcastle-dev-shop.web.app\",\"debug_mode\":\"enabled\",\"report_id\":\"1a1b25aa-5e1b-43fc-b80e-9cc9e8ce7658\",\"reporting_origin\":\"https://privacy-sandcastle-dev-dsp.web.app\",\"scheduled_report_time\":\"1704216262\",\"source_registration_time\":\"0\",\"version\":\"0.1\"}", \
Debug Information on Privacy Budget Exhaustion:

Based on the disjoint batches, would you be able to ensure that there are no overlaps on the batches based on the Shared ID? Please ensure that source_registration_time (truncated by day) and scheduled_report_time (truncated by hour) is taken into consideration.

If the above has been considered and you're still getting an error, we're happy to look into your batch.

Answer 5 · 2024-01-15T16:59:52.000Z

Hi @maybellineboon,

Thanks a lot for your explanations. It seems we had misunderstood the batch size limits as being a limit for the number of reports per batch (and not a limit of the number of reports' shared IDs), which caused the errors we were encountering in our aggregation process.
This was perfectly clarified by your latest explanation and we are indeed seeing no more errors in our latest testing after implementing the changes you had recommended.
Our tests are progressing smoothly and while we may come back to this issue in the future if we face other related issues, we believe we don't need any further information on the batch size limitations.

Of course, additional debugging capabilities will always be welcome, and we are keeping our eyes peeled for any progress in that direction :)

Answer 6 · 2024-01-16T22:32:58.000Z

Hi @CGossec ,

Thank you for your confirmation. Happy to have helped.

Any updates, we would be sharing this in this Github page or via an external announcement email from Aggregation Service.

Do let us know should you have further questions.