opendp/smartnoise-sdk

Re-adding accuracy estimation to queries

FishmanL opened this issue · 1 comments

What would it take to get accuracy estimation back for more complex agg like avg, assuming we're just using laplace and gaussian?
(it's pretty urgent for us)

The analytic formula for AVG accuracy ended up not being very useful. We now just simulate to get the accuracies. The pre_aggregated option was added to facilitate this, documentation here.

Something to note is that you can get the simulated accuracies "for free" if you simulate on-top of the noisy answer, since all future runs count as post-processing. For example, this code example uses only epsilon of 1.0, and gives 95% bounds for AVG(income) and AVG(age):

from copy import deepcopy
import pandas as pd
import tqdm
from snsql import Privacy, from_connection

pums = pd.read_csv("../datasets/PUMS.csv")
yaml = "../datasets/PUMS.yaml"
privacy = Privacy(epsilon=1.0, delta=1e-5)
conn = from_connection(pums, metadata=yaml, privacy=privacy)

query = "SELECT AVG(age), AVG(income) FROM PUMS.PUMS"

query_ast = conn.parse_query_string(query)
from_clause = deepcopy(query_ast.source)

subquery_ast, _ = conn._rewrite(query)

subquery_ast.source = from_clause
subquery = str(subquery_ast)

print(f"Original query: {query}")
print(f"Subquery: {subquery}")

inner = conn.execute(subquery)
# inner = conn.reader.execute(subquery)

print(f"Inner result: \n{inner}")

age = []
income = []
n_runs = 1000
for _ in tqdm.trange(n_runs):
    noisy = conn._execute_ast(query_ast, pre_aggregated=inner)
    age.append(noisy[1][0])
    income.append(noisy[1][1])

alpha = 0.05

age = sorted(age)
income = sorted(income)
age_lower = age[int(n_runs * alpha / 2)]
age_upper = age[int(n_runs * (1 - alpha / 2))]
income_lower = income[int(n_runs * alpha / 2)]
income_upper = income[int(n_runs * (1 - alpha / 2))]

income_reported = inner[1][2] / inner[1][0]
age_reported = inner[1][1] / inner[1][0]

print(f"Reported age: {age_reported} will be in [{age_lower}, {age_upper}]")
print(f"Reported income: {income_reported} will be in [{income_lower}, {income_upper}]")

This is a more advanced version of the example in the docs, where in the docs the caller needs to know that the AVG is composed of SUM and COUNT, in this example we use the query rewriter to extract that information. In this way, you could pass in any other formulas and still automatically obtain the right subquery needed. The commented-out line conn.reader.execute could be uncommented to skip the differential privacy on the inner query to obtain exact results from the database, but this would then mean that you would be paying for privacy spend on every simulation run. Since the line above runs through the differentially private connection, the values obtained will instead be noisy, which means you can report them, and you can simulate "as if" they are the real values to get an estimate of accuracy impact.

Note that the database only gets queried once in this sample. The 1000 calls to pre-aggregated all happen in-memory on the CPU without accessing the original data. We pass in the query_ast to make them a little bit faster, but could just as easily pass in the query string. It takes about a minute on my machine, and we could probably speed that up substantially by skipping the repeated validation steps (since it's the same query, it only needs to be validated once).

This approach is best for AVG and VAR when there are GROUP BY with large differences in partition size. For example, if reporting statistics by county or ZIP code, there will be some partitions that are gigantic, and others with very few members, and in that case the errors will be very different per partition. Our preferred way to handle that is to simulate exactly as above, but bin the noisy values by the noisy count of members. For example, collect all noisy AVG and VAR values for bins with between 1-100 members, bins with 100-500, 500-2500, and 2500 and above. Then we report the average +/- error for each bin size.