privacysandbox/aggregation-service

Aggregation service, ARA browser retries and duplicate reports

Closed this issue · 2 comments

The way the browser and adtech's servers interact over the network makes it inherently unavoidable that some reports will be received by the adtech but not considered as such by the browser (e.g. when a timeout happens) and hence retried and received several times by the adtech; as is mentioned in your documentation:

The browser is free to utilize techniques like retries to minimize data loss.

Sometimes, these duplicate reports reach upwards of hundreds of reports each day, for several days (sometimes several months) in a row, all having the same report_id.
The aggregation service runs the no-duplicates rule basing itself on a combination of information:

Instead, each aggregatable report will be assigned a shared ID. This ID is generated from the combined data points: API version, reporting origin, destination site, source registration time and scheduled report time. These data points come from the report's shared_info field.
The aggregation service will enforce that all aggregatable reports with the same ID must be included in the same batch. Conversely, if more than one batch is submitted with the same ID, only one batch will be accepted for aggregation and the others will be rejected.

As an adtech company, when trying to provide timely reporting to clients, it is paramount to try and use all of the available information (in this case, reports) in order to have our reporting be as precise as possible.
In this scenario, however, if we try to batch together all of our reports for a chosen client on a chosen day, even by deduplicating all of the chosen day's reports through the report_id (or the overall shared_info) field, we may have a batch accepted on day 1, and then all subsequent batches for the next month be rejected because they all contain that same shared_info-based id.
This means that we have to check further back in the data for possible duplicate reports. To be able to implement this check in an efficient manner we would benefit from a more precise description of the retry policy, namely for how long the retries can happen.

I guess the questions this issue raises are as follows:

  • In what scenarii does a browser go for the aforementioned retries?
  • Is there a time limit for those retries (i.e. a date after the original report when the browser no longer retries sending a report)?
  • If there is not, could you please advise on a way for adtech companies to efficiently filter out duplicate reports without having to process all of their available reports for duplicate shared info values?
  • Also, the described problem of "duplicate retried" reports, but not only, makes us believe that adtechs would benefit from a modification to the way the AS handles duplicates. Indeed if the AS gracefully dropped the duplicates from the aggregation instead of failing the batch altogether, we wouldn't necessarily need to filter out such reports from a batch. Could this possibility be considered on your side?

Hi @CGossec,

From the Attribution Reportin API Handbook, we can confirm that reports are sent when a browser is online. And should reports fail to send the first time, the report is retried after 5 minutes. And after the 2nd failure, the report is retried again after 15 mins. Should that fail again, the report will not be sent.

With that in mind, we do recommend to wait a bit for late arriving reports to be able to collect all reports for batching. Late reports can be measured by checking the scheduled_report_time and when the report was received. This can help estimate roughly how long you may want to wait for late arriving reports

For duplicate reports (reports with same shared ID), as of the moment, Aggregation Service does not maintain a record of individual processed reports. It is recommended to filter out duplicate reports before sending to Aggregation Service. You can use report scheduled_report_time (rounded to hour) to batch them.

Aggregation Service does keep track of shared ID which is a hash of different fields (api version, reporting origin, destination site, source registration time, scheduled report time). You can also keep track of what has been batched by the fields used in shared id. That way, should a duplicate report arrive late and the reports for the same specific fields has already been batched, you can skip the report as the report will fail due to privacy budget.

Hi @CGossec ,

I'll proceed to close this down. But do let us know should you still have more questions.

Thanks!