pholser/junit-quickcheck

Controlling size for nested data structures?

mikera opened this issue · 5 comments

Love the library! It is proving very useful. Quick question....

SITUATION: I'm writing some custom generators for data structures that potentially need to generate multiple nested structures of the same type.

PROBLEM: There is a certain threshold where the sizes of the generated data structures seem to explode exponentially, probably because the expected number of children increases at each level (which in turn have a larger expected number of children themselves...)

QUESTION: Is there a good way to control the sizes of nested structures? Ideally if I am generating a data structure of size n with m children, it seems like it would be sensible for each child to be generated with size n/m but it's not quite clear how to do this?

@mikera Thanks for checking out junit-quickcheck. I don't plan to develop on it any further -- you may want to investigate Jqwik by @jqwik-team.

That said, I'd be interested to check out your custom generators. Can you post some code?

Thanks for the reference, I will take a look!

Here's an example of a fairly minimal generator that blows up (stack overflow) while creating a proportion of nested instances of the same type (ArrayLists of ArrayLists and Longs):

import java.util.ArrayList;

import com.pholser.junit.quickcheck.generator.GenerationStatus;
import com.pholser.junit.quickcheck.generator.Generator;
import com.pholser.junit.quickcheck.random.SourceOfRandomness;

/**
 * Generator for arbitrary trees of Longs
 */
public class BadListGen extends Generator<Object> {
	public BadListGen() {
		super(Object.class);
	}

	@Override
	public Object generate(SourceOfRandomness r, GenerationStatus status) {
		int size=status.size()+1;
		int type = r.nextInt(10);
		
		switch (type) {
			case 0: 
				int n = r.nextInt(size);
				ArrayList<Object> al=new ArrayList<>();
				for (int i=0; i<n; i++) {
					al.add(generate(r,status));
				}
				return al;
				
			default: 
				Long l=r.nextLong(-size, size);
				return l;
		}
	}
}

Ideally, I think the size in the nested call to generate needs to be reduced to something like size/n on each iteration? The intention would be that the size controls the overall number of elements in the tree, rather than the size of each child

@mikera Ah yes, I see what you mean. Sadly, I didn't make accessible GenerationStatus derivatives that would allow you to futz with size, etc. Your best bet may be to make a GenerationStatus subclass where you can control/modulate size in the manner you wish, and you can feed to your recursive calls. There is an internal-but-maybe-still-accessible AbstractGenerationStatus class you can use to get started, if you don't want to implement the interface directly.

@mikera What do you think about something like this?

public class BadListGen extends Generator<Object> {
    public BadListGen() {
        super(Object.class);
    }

    @Override
    public Object generate(SourceOfRandomness r, GenerationStatus status) {
        List<Object> root = new ArrayList<>();
        for (int i = 0; i < status.size(); ++i) {
            root.add(_generate(r, status.size(), 1));
        }
        return root;
    }

    private Object _generate(SourceOfRandomness r, int size, int level) {
        int type = r.nextInt(10);

        if (type == 0) {
            int numberOfChildren = size == 0 ? 0 : r.nextInt(size);
            List<Object> children = new ArrayList<>();
            for (int i = 0; i < numberOfChildren; i++) {
                children.add(_generate(r, size / (level + 1), level + 1));
            }
            return children;
        }

        return r.nextLong(-size, size);
    }
}

Thanks @pholser for taking a look - your suggestion works well in simple cases

Only slight challenge is if the generator is parameterised and needs to generate something (e.g. leaf nodes) with a generic Generator that doesn't have a _generate and therefore needs a GenerationStatus. Perhaps I can play around with some clever subclassing to solve this :-)

Perhaps an interesting research topic for someone would be exploring how test data generators can be structured to provide provable O(size) construction time and space.