litestar-org/polyfactory

Enhancement: Add a rejection sampler

Opened this issue · 5 comments

Summary

Currently the batch method fails with a validation error if any of the generated rows fail the schema validators. To allow use of the package in a testing environment, it would be useful to be able to generate a dataframe of any size using a rejection sampler method. This method should store the random seeds of successful builds in order to reproduce the same dataframe each time.

I have created a class that performs these actions included below. Given this is something I have needed for my project, it could be a useful feature for others wanting to use Polyfactory for testing. I built it based off the original pydantic factories package, but I imagine it would be pretty similar for the additional Factory options in Polyfactory.

Basic Example

import time
import json
import pandas as pd
from polyfactory.factories.pydantic_factory import ModelFactory

class RejectionSampler:
    """Function to create a synthetic dataset based off the pydantic schema,
    dropping rows that do not meet the validation set up in the schema.

    Parameters
    ----------

    factory (ModelFactory): pydantic factories ModelFactory created from pydantic schema
    size (int): Length of dataset to create
    """

    def __init__(self, factory: ModelFactory, size: int) -> None:

        self.factory = factory
        self.size = size
        self.used_seeds = []

    def setup_seeds(self):

        start = time.time()

        synthetic_data = pd.DataFrame()

        # start seed at 1, increase seed by 1 each pass/fail of factory.build() to ensure reproducibility
        seed_no = 1

        for _ in range(self.size):
            result = None
            while not result:
                try:
                    self.factory.seed_random(seed_no)
                    result = self.factory.build()
                    result_dict = json.loads(result.json())
                    synthetic_data = synthetic_data.append(
                        pd.DataFrame(result_dict, index=[0])
                    )
                    self.used_seeds += [seed_no]
                    seed_no += 1
                    result = True
                except ValidationError:
                    seed_no += 1

        end = time.time()

        print(f"finished, took {seed_no-1} attempts to generate {self.size} rows")
        print(f"took {end-start} seconds to setup seeds")

    def generate(self):

        start = time.time()

        synthetic_data = pd.DataFrame()

        for seed in self.used_seeds:
            self.factory.seed_random(seed)
            result = self.factory.build()
            result_dict = json.loads(result.json())
            synthetic_data = synthetic_data.append(pd.DataFrame(result_dict, index=[0]))

        end = time.time()

        print(f"took {end-start} seconds to generate new data")

        return synthetic_data

Drawbacks and Impact

No response

Unresolved questions

No response

Fund with Polar

Hiya, sure - but this requires a dependency on pandas or Polars, no?

True, it could instead just return a list of jsons removing the pandas dependency but keeping the reproducible valid batch component?

Yes, this should not have a dependency on any third party library to do.

How about adding the possibility of installing it as an extension?
I mean something like
pip install polyfactory[pandas] or pip install polyfactory[extras]

@williamjamir why do you want to have pandas for this? Also, I'm not sure about the need for this feature. polyfactory should not be creating instances that fail the validation of any of the libraries it supports. If it does, then that's a bug which should be fixed.

EDIT: Actually, this could be useful where you have your own custom validators which polyfactory cannot support. Though if this is the case, I think the better option would be to use Use or implement a classmethod for those fields to generate values that will pass your custom validators as well.