pelias/acceptance-tests

Fuzzy Testing

orangejulius opened this issue ยท 6 comments

Testing the overall functionality of a geocoder is not like testing many other software projects.
While a geocoder's interface is as simple as a single function in a programming language (in our
case: send a string as input, and get GeoJSON back), the underlying complexity is a lot different.

Worse, there is lots of data involved, and at an acceptance-test level it's impossible to separate
correctness of the data and correctness of the code. Both have to work together to produce a correct
result. With a geocoder using open data we don't even have full control over our data. It can change
at any time without action on our part and changes we suggest to make things better may not stick.

So it's very clear that unit-test style tests will not be completely sufficient for testing a
geocoder. We will need something else as well.

What do we need?

What problems have to be solved to accurately and completely test a geocoder? At the very least, a
solution has to have the following properties:

Scales well to thousands of tests or more

We are literally testing the entire planet, so there are going to be a lot of tests.

Accepts that some tests will always be failing, and still provides value

A build with 10,000 tests that fails if any one tests fails will never pass, and simply getting
one bit (pass or fail) of information back from running tests will not tell us enough.

Gracefully degrades when data or code changes

Many teams hate writing tests because they end up spending too much time updating tests to pass when
their newly written or refactored code already works. We have that problem, and the problem of
previously passing tests spontaneously failing because of changing data.

Handles tests of low quality

We aren't going to write 10,000 tests ourselves. The majority of them will hopefully come from
non-technical users not involved in our project at all. Some of these tests will be bad but we won't
have time to improve them all.

What we should build?

The acceptance tests we should keep around

The first part is close to what we already have: a very closely watched set of acceptance tests. We
had about 300 of these tests until recently, when many of them were pruned. I think we should get
this down to about 50.

We should carefully craft and maintain all of these tests. Each one should very specifically test one
particular feature or bit of functionality. If two tests do almost the same thing we should get rid
of one of them. When data changes we should be responsible for manually updating the tests. The idea
is that we have so few that this isn't a huge issue.

Peter's Guide to Writing Solid Tests is certainly the starting point of our high standards for these tests.

Fuzzy testing suite

The new part we need to build is, for lack of a better name, a fuzzy testing suite. This will have
more in common with the tests for, say, a statistical machine translation test suite than a
unit-test suite.

This fuzzy testing suite will have the thousands and thousands of user-generated tests, and at any
given time a bunch of them will be failing. We will almost never update the expectations of these
tests, and as our most strongly cherished rule, we will NEVER change the input string for these
tests. If we think of a new input string that's even slightly different than one we already have,
we'll just add another test.

How to make the tests valuable

After collecting all of these tests, we'll have to do a bit of work to make tooling to even get any
value out of them.

First, instead of the main output of a test suite being simply pass/fail, the output should be given
as a score. The score for the test suite will simply be the sum of the scores from all the tests,
and for each test it will be based on how many of the expectations of the test passed. This gives
our tests a little bit more granularity. The simplest implementation would give one point for each
expectation, so a test that expected certain values for two admin names and one name field could get
three points if all three matched.

The test suite score could be reported as a total score or a percentage of the maximum possible
sore, I'm not sure which yet. Percentages seem popular for this sort of thing.

The other thing we'll want is some sort of tool set for easily viewing differences in test failures
between two builds. This would be used to see how certain code changes affect our success rate in
the tests, and occasionally could be used to update tests if we feel the need.

It sounds like this should be a diff-like utility, and if we can have our test suite output simply
be readable by diff itself, or a thin layer over diff, that saves us time building stuff and
learning a new tool.

Unsolved problems

There's a few other questions and problems we have that this plan doesn't directly solve. I'm open
to suggestions, pleas feel free to offer any thoughts!

Builds break sometimes

Sometimes something is wrong with our infrastructure and a bunch of tests are failing for a reason
unrelated to the data or code. Ideally we'd have a way to filter out these situations so they don't
cause us too much wasted effort figuring out why all our tests are broken.

Reporting pass/fail to travis, etc

While our test suite will not work in terms of a simple test/fail, lots of tools we might want to
use like bash scripts and TravisCI still do. How do we map our test output to a pass/fail when we
need to? One idea I had would be to count any test suite as a pass as long as none of the tests that
passed in the last build fail (no new regressions), but this requires that our testing tools keep
track of previous runs, which would be a bit of work.

Insight from tests changing over time

@riordan made a great, brief comment to me today about how older feedback tests and test results
could be used as snapshots of the progress of our geocoder. We could compare test results for the
same tests over time pretty easily.

Another related question is what to do as our test suite ages: a 2 year old test to find a building
that was demolished last year probably isn't useful. Maybe we assign more weight to newer tests, or
something similar.

awesome write-up! +Infinity

๐Ÿ‘

Excellent framing. Now onto talking specifics.

What if our acceptance-test framework is actually a simple server with an endpoint to run tests and the ability to store results on the server (mongodb anyone? ๐Ÿ˜„). Index page provides a dashboard view of recently executed tests. A progress bar for any in-progress runs would be lovely. Having the history stored locally will allow us to do various reporting over time.

Something like this would be sufficient at first:

pelias.testrocity.com/run?label=my_prod_test&endpoint=pelias.mapzen.com&regions=usa,london

We can still run it locally during development and while adding test cases. Once PRs are merged into the production branch it will auto-deploy and restart the service.

Running this on our servers will allow us to provide it with an uber API key with very generous limits so tests don't take forever. And we can easily hit it from circle or travis when needed.

Thoughts?

Yeah, that's a good idea. It seems like a certainty that to get all the
reporting we'll want, there has to be a server component. I was reading
through the Travis docs yesterday to see if they had any little known
features for storing state between test runs (to do something like output
which tests are newly failing in the latest build, etc). Unsurprisingly
they actually go out of their way to talk about all the ways state is NOT
shared between test runs, which is probably what their customers want most
of the time.

We already use mongo for the actual feedback test gathering, so it's a
natural choice.

On Mon, Jul 6, 2015 at 7:57 PM, Diana Shkolnikov notifications@github.com
wrote:

What if our acceptance-test framework is actually a simple server with an
endpoint to run tests and the ability to store results on the server
(mongodb anyone? [image: ๐Ÿ˜„]). Index page provides a dashboard view
of recently executed tests. A progress bar for any in-progress runs would
be lovely. Having the history stored locally will allow us to do various
reporting over time.

Something like this would be sufficient at first:

pelias.testrocity.com/run?label=my_prod_test&endpoint=pelias.mapzen.com&regions=usa,london

We can still run it locally during development and while adding test
cases. Once PRs are merged into the production branch it will auto-deploy
and restart the service.

Running this on our servers will allow us to provide it with an uber API
key with very generous limits so tests don't take forever. And we can
easily hit it from circle or travis when needed.

Thoughts?

โ€”
Reply to this email directly or view it on GitHub
#109 (comment)
.

Thanks,
Julian

couchdb comes with a restful API and well, how to put this nicely, it isn't mongo. it might also pay to loop in @heffergm when making decisions on services you would like maintained in production.

I think the general idea with the travis artefacts feature is to have a place to retrieve stuff like code coverage reports after the instance has been destroyed, your other option is obviously to send the data out of travis manually via HTTP.