geocoders/geocoder-tester

Improving one geocoder

Closed this issue · 10 comments

As discussed in komoot/photon#263 I would prefer contributing to this repo instead of forking it. One question for me is how we should improve one geocoder. I mean, there will be failing tests that can be skipped until the feature is implemented or if the ranking is improved but certain examples must not fail. I can easily add a 'skip' to the data into the forked version but how will this work if I do not fork it? Are there plans for external skip or "ok-to fail" files?

I don't know, I only build geocoders where all tests pass! :p

Of course not :)

What I usually do is to run a subset only while I'm developing, and I switch to a bigger set (only France in case of Addok) when I feel I'm good with subsets.
I keep the failing tests: they may fail because the geocoder is doing bad somewhere, but they can also fail because the data is missing or is bad, or because the search itself does not make sense (I try to fix the tests cases for those).

About subsets, remember you can either point at at a subfolder:

py.test geocoder_tester/world/france/iledefrance/

Or use the markers to filter in or out some tests:

py.test -m iledefrance  # This will do the same as above.
py.test -m 'not poi'  # This will remove all tests marked as 'poi'.
py.test -m 'iledefrance and not poi'  # This will combine both.
# And so on. :)

Or you can filter by the name

py.test -k 'Berlin'  # This will only run the tests that contain 'Berlin' in the query

Now, what I do to track if I'm doing better or worse, I use the report option.

So for example, I'd first make a run with the "stable" version, which will be my reference version, by running a command like this:

py.test --save-report reports/master-YYYYMMDD.log

Then when I want to tests some changes, I run a compare with that report, something like

py.test --compare-report reports/master-YYYYMMDD.log

This will highlight the differences between the two runs: which tests were passing before and are now failing, and which tests were failing before and are now passing.

This does not directly answer your question, but that may be a different approach to deal with it.

Another point, about the ranking: when writing tests cases, we can define how many results we want to load to look for the the expected result. In many cases (at least for generic requests), I think it's fine to be more tolerant here, and not only load one result. So some "failing tests" may only failing be because the tests cases are too strict.

Also, you may want to use --loose-compare, so you don't have tailing tests because of a uppercase/lowercase or accent missing in the data set you are using.

Now, if those elements does not make your workflow okay, there is certainly room to better mark the tests, either by adding some skipif(geocoder=photon) in the data itself, or by finding a way to do that without needing to alter the tests case themselves. This is something I've not considered yet, so a bit of thinking will be needed.

Thoughts? :)

Now, if those elements does not make your workflow okay, there is certainly room to better mark the tests, either by adding some skipif(geocoder=photon) in the data itself, or by finding a way to do that without needing to alter the tests case themselves. This is something I've not considered yet, so a bit of thinking will be needed.

As there are not that many geocoders maybe we could indeed introduce a skip column and have a comma separated list for the geocoders that should not be queried with them? Or the opposite: mark the tests with photon if they should be included (then one could easily grep the text file before usage)

I think for photon it would be very valuable if we could export only the successful tests to one huge csv. This way we could integrate this in our testing flow easily without making geocoder-tester too ugly with a new skipif(photon) or similar and we can fail hard if a single test fails. And from time to time we would update this huge export file from geocoder-tester, e.g. when we need new tests as we would not manually add tests to the huge csv.

Do you think this makes sense? How hard would this be?

(or if one huge csv is not easy just extracting the successfull rows from each csv file and pipe into a new one, where I would not need to preserve directory structure)

Hmmh, maybe --geojson can help me with this but I wasn't successful to make it printing anything, also not with any value of --tb. The --tb option does not seem to make any differences too and always prints the comparison table.

Do you think this makes sense? How hard would this be?

Have you tried with --save-report and --compare-report? this is supposed to be used for such need.

Ah, now I also understand this feature. I previously thought it does not work as it didn't store anything when every assert was successful.

Thanks a lot!

Hmmh, the only problem with this approach is that I would need to run it against all asserts, including the known failing asserts. Leading to much slower test execution.

But this could be fixed with an option like --skip-known-failures (only makes sense when using --compare-report). Of course no new successful asserts will appear but this is also not necessary to just avoid quality decline for automated quality checks.

Do you think this is feasible? Created #37

@karussell have you tested with --tb no ?