Definitions

  • SUT Subject Under Test. The thing being tested. In OO unit tests, it will usually be an object.
  • TDD Test Driven Development. A test-first practice to get the most useful design information out of your tests
  • BDD Behavior Driven Development. A style of TDD specifically working from the outside of your application inwards.
  • Regression A regression test is a test that documents a bug, and verifies that it does not return. They are basically created as follows: I discover a bug. I write a test (at the lowest level possible) which fails because of the bug. I then fix the bug, and verify that the test passes. I can now keep this test around to A: document my assumptions about the bug (because if it crops up again, but the test is still passing, then we know the bug was incorrectly diagnosed) and B: prevent regressions. In long-running codebases some pernicious bugs will have a tendancy to re-appear (often due to a misunderstanding of the domain by the development team), and good regression tests help prevent their re-appearance.
  • Test Double Any code created to stand in for another item in a test suite. Kinds of test doubles are:
    • Stubs These simply return a fixed response
    • Fakes Test fakes have more logic than a stub, for instances in which the double needs to do something more than just return a fixed response. These should be used only in moderation.
    • Mocks A mock is a test double that assertions can be made about.
    • Spy A specific way of making a mock. (Only listed for completeness. Not going to be covered here).

Test Suite Layers

  • Unit Unit tests isolate the smallest piece of behavior possible. In OO this will usually be an object. “Unit” is subjective, and will be defined differently in different domains.
  • End-To-End/Feature/Acceptance/UI This is the highest level of testing. “End-To-End” means that you are only testing what is at the outer layers of the application (usually the UI). “Feature tests” means the SUT is a chunk of behavior, a feature, rather than a chunk of code. “Acceptance test” is a feature test written to be a automated verification of the acceptance critieria of a user story.
  • Integration/Service Integration tests are everything that isn’t End-To-End, but that involves multiple pieces working together. What that means depends on how “unit” is defined. You’ll often also see these called “service” tests.

Ratio

The most cost-effective test suites will tend to adhere to the testing pyramid.

testing pyramid

./test_pyramid.gif

The classic testing pyramid says that the bulk of the tests should be unit tests. A far smaller number of tests should be integration tests. And and even smaller number than that should be end-to-end tests.

The main goal of the testing pyramid is cost effectiveness. It is a necessary reminder because its opposite, the inverted test pyramid, will often be the outcome of undisciplined teams.

The inverted pyramid is a common mistake because end-to-end tests often appear to provide the highest value. They adhere to the user experience, they are the easiest for non-programmer staff to write and understand, and cover the most functionality.

But the inverted pyramid will result in a test suite that provides a very low value for the cost of its upkeep.

Desirable Qualities

A good test suite is:

Focused

The logic to be tested is contained in as small and isolated a SUT as possible. Ideally a change/break in logic should only break tests in a couple of places. Once at the Unit level, or possibly Integration, and optionally once at the end-to-end level. A focused test suite is the result of applying the single responsibility principle to the test, pushing the test to the lowest layer possible (as per the testing pyramid), and knowing what not to test.

A focused test suite will emphasize testing SUTs that are:

  • heavy on logic/math
  • less prone to frequent changes
  • valuable (at the core of the app, used frequently, and critical to the business)

Fast

The faster a test suite is, the more often it will be run. The more often a test suite is run, the shorter the gap will be between creating a bug, and discovering the bug. The shorter this gap is, the easier the bug will be to diagnose and fix.

Fast test suites enable smart testing practices like TDD.

A fast test suite will encourage writing fast tests, which encourages a proper testing pyramid because well-written unit tests are often the fastest.

Readable

A well-written test suite is like self-verifying documentation. The tests will not just verify behavior, but will demonstrate how the SUT is to be used.

Concrete

Abstractions in tests are less desirable than abstractions in application code. It is better for a test to be verbose, than to hide what is going on. (See obscure test)

Assert literal values whenever possible (See ugly mirror). Assertions should not compare values computed by the application, to values computed by the tests. Ideally assertions should be comparing computed values to hard-coded values. Often if it seems like the expected value must be computed, a golden master can be used instead.

Flexible

A good test suite will withstand refactorings in application code without needing to be changed. It is important that, as much as possible, failures in the test suite are useful failures.

This is achieved with by following best practices to decouple the test code from the application code. This is easier said than done, and largely depends on how well the application code is designed. Beyond that, learning what not to test is important for this.

Undesirable qualities

For comparison’s sake, a painful/expensive test suite is:

Scattershot

The tests are wide in scope. Breaking some piece of application logic will sometimes result in no additional test failures, but at other times will result in several tests breaking all at once.

Scattershot tests give no indication to future developers what is to be tested or how, and will lead to more scattershot tests.

The scattershot test suite will emphasize SUTs that are:

  • mostly glue code
  • constantly changing
  • at the edges of the system, and not critical
  • A good example of these is a test suite that has more view/GUI tests than anything else

Slow

Slow tests suites will rarely be run by developers. Often they will end up just being caught by the CI system, because instead of wasting time running them locally developers will just rely on the CI system to catch failures. Sometimes days will pass between a bug being created, and being flagged by the test suite. This delay means that the developer might not remember the context the bug was created in, or might have to make significant changes to the code, due to the bug effecting architectural decisions.

Fast test suites make TDD impossible.

Slow test suites tend to fall into disuse and thereby are often the first step towards a legacy test suite that no developer wants to touch, requires dedicated staff to maintain, and provides relatively little value.

Obscure

A badly written test suite can take longer to understand than just reading the SUT itself. It will verify logic, but will give no indication of why that logic is important, or how the SUT is to be used.

Abstract

Bad test suites will use abstraction to achieve concision and avoid textual repetition.

The tests will compute values to compare to values computed by the application. Errors like returning a bad value won’t be caught because the logic in the test returns the same bad value.

Brittle

Bad tests are so tightly coupled to the application code that will need to be updated every time there’s an update to the application code, regardless of the importance of the change.

Brittle test suites eventually become abandoned test suites, because they will require lots of work to update, but the failures will not be instructive so people will stop pyaing attention to them.

Unit Tests

“Unit testing” has become such a vaunted idea that the usage of the term has become more widespread than the understanding of the ideas behind it.

There is no fixed definition of “unit,” across all projects. A unit is a chunk of logic/data which makes sense to test atomically, and this will vary in different applications. In object-oriented contexts, it almost always refers to an object. In functional contexts it usually refers to a function. The important thing is that the team on a project has a shared understanding of what the unit is.

I’m assuming an OO context for these notes.

What to test

A good unit test will treat the object it is testing as a black box. It will only test the objects interface (as defined by the object’s public methods).

How to test these methods depends on whether they are query or command methods.

Query Methods

A query method only computes and returns values. It has no stateful effects on the rest of the system (like writing to a database). But it can call query methods on other objects in the process. A basic test of a query method consists of sending the object a message, and verifying the correctness of the response.

If the query method in turn sends a message to another object, that outgoing message does not get tested here. That message is part of the other object’s interface, and should be properly tested in that separate context.

Command Methods

A command method is one which will send messages to other parts of the system, which effect stateful change. An example is a method which changes the state of another object, and saves it’s value to a database.

Unlike query methods, the return values of command methods should not be tested. Instead their outgoing messages should be tested, usually with mocks.

Isolation

Unit tests are supposed to test objects in isolation. But our objects are dependent on the interfaces of other objects, the filesystem, our database, etc. So we use test doubles to enforce the boundaries of what we want to test.

These techniques should not be used dogmatically. There are benefits and costs to using test doubles. And the costs will be higher in systems that aren’t well-factored. While they shouldn’t be used in all cases, in a real suite of unit tests cannot be created without test doubles (in an object oriented system).

Query Methods

Because we aren’t testing the outgoing messages from our query methods, then we will use stubs to enforce the boundaries of our tests.

Command Methods

Since we need to test the outgoing messages from our command method, we use mocks.

End-To-End

An end-to-end test simulates the user experience. For most applications that means that both the test excercise and the expectations take place in the UI. An example would be filling in the username and password form, clicking “submit”, and expecting to see the message “Thanks for signing up!”. It does not reach down below the UI. So in this example, we would not be then checking the database to verify that the user exists.

Good

End-to-end tests are valuable because they can simulate the workflow that a real user would have. They are also the only tests which touch every layer of the system.

End-to-end tests are also the easiest for non-programmer staff to read and understand. This can help get test suite buy-in from non-programmer staff.

End-to-end tests are also the least likely to lie about the functionality of the system, since they (typically) have no test doubles. All of the code excercised in the tests is real code.

They allow for a BDD workflow, if they are kept lightweight and fast enough.

Because of these qualities they are a necessary part of the app.

Bad

In most applications the end-to-end tests are necessarily the slowest tests. They are the slowest because they involve every layer of the app, and the tools that run them are typically slower than other testing tools. For web applications most end-to-end tests have to actually open up a browser.

End-to-end tests are also the most brittle tests. They end up being tightly coupled to the UI, since they have to read and write to it, and the UI is the app layer which is subject to the most change. Because they’re so brittle, each end-to-end test added brings a higher long-term maintenance cost than

Ugly

The classic inverted pyramid tends to happen for a reason. The benefits of end-to-end tests are obvious, while its drawbacks are subtle and pernicious.

The fact that they are easy for non-programmer staff to read (and sometimes write, depending on the tools used), often leads them to be over-valued by the business.

It is common for end-to-end tests to be the start of a downward spiral of maintanence costs. Because they are easy to write, and one end-to-end tests can touch many parts of the codebase at once, teams that feel they “don’t have time for unit testing” will write end-to-end tests instead. Espeically because the end-to-end tests can be written by non-programmers.

Where the pernicious costs of end-to-end tests start to come into play is when people start trying to cover corner cases (basically anything except the happy path). A mature application will have several layers interacting with eachother, and people will eventually realize that they need to cover all the possible interactions between those layers. Trying to cover all the possibilities of those interactions from end-to-end tests will bloat the number of required tests exponentially. (See Integration Tests Are A Scam in the references).

Over time the end-to-end tests, which originally were seen as saving time, will be so slow and brittle that programmers will want nothing to do with them. Many teams end up needing to hire people soley to maintain their test suite, because it’s become such a maintenance burden. And having a test suite maintained by people who aren’t actually writing the code being tested leads to its own class of issues.

implementing

How to take advantage of end-to-end tests while avoiding the pitfalls?

Judicious

Because of the long-term maintenance expense of these tests, the value proposition should be high. A good end-to-end test will cover critical features. “Critical” here should be a high bar. Critical as in: each minute this feature isn’t working, our company loses money.

A good example is testing that users can signup, and pay for an item with a credit card.

A bad example would be testing that the user’s email appears in the proper format on the user settings page. A bug here is annoying but not critical.

Comprehensive

Since each test is expensive, it should be as comprehensive as possible. Unlike unit tests, which aim to cover one and only one thing, a good end-to-end test should cover as much (crticical) functionality as possible in as few tests as possible.

So rather than writing 4 seperate tests for 1: signing up, 2: logging in, 3: sending money to my Dad, and 4: purchasing a new horse, we should write one test that does all of the above.

Optimistic

End-to-end tests should only cover the happy path. Corner cases should be covered in a lower-level test.

For example: Instead of testing that every possible error case results in a sensible message in the UI, we could instead just test that error messages in general (by verifying just one of them), and then in a lower-level test we can verify that all the correct user-facing error messages are generated in the correct cases.

Well-maintained

Since end-to-end tests tend to be the quickest test to fall into disfavor with programmers, they need to be vigilantly maintained to be relevant, and as fast as possible. As soon as they become neglected and out-of-date, they are at risk of kicking off the downward spiral of maintenance costs.

Integration Tests

“Integration” my is a catch-all term for everything that involves more functionality than a unit test, but is lower-level than a UI test. People use lots of different denote this layer, and there are a large number of test types that fall into this general category.

As a general rule, any logic in an integration test that can be fully tested in a unit test, should be moved to a unit test.

Classic Integration test

A classic case for an integration test would be a test that covers the interactions between 2-3 objects in the system. For example: You originally had a single object that was covered by a unit test, but due to changing requirements, it started to take on too much responsibility, and so you’ve factored the logic out into multiple smaller objects. While doing so you probably created unit tests for each object, but in order to cover all the logical cases of the pre-refactoring unit test, you write an integration test which tests for the correct interaction between your new smaller objects.

Another typical integration test scenario is testing saved database queries. These tests are important, but they are not unit tests because they are cutting across layers in the system (from the app to the database). This doesn’t make them bad tests, but it should be noted. If you have several objects that need to call into the database, a good practice is to stub/mock the database call in the individual objects unit test, and have a separate group of integration tests to verify the actual database queries. This segregation will A: keep the test suite from becoming bogged down with slow database queries, and B: keep the testing of the query logic in one place. If a database schema changes in a bad way, we want just a couple tests to break at most, rather than a scenario in which 20 tests need to be updated because they were all relying on relevant queries.

API

Tests of an internal API fall into this category. Similar to unit tests, API calls that are “queries” should only test the value that is returned, and API calls that are “commands” should verify that the correct message was sent.

Example: An API call to retrieve a list of users should only verify that the returned list has the correct contents in the correct format. Testing what database call is used makes our test too tightly coupled to that implementation (in the future we may decide to cache that query and not hit the database at all, or use a different query).

An API call that sets a users password however, should verify that the proper state update took place, and that the database was updated with the correct information.

API’s to an internal service are the boundaries of that service. And so from one perspective they are the end-to-end tests of that service. Over-specified API tests can have similar drawbacks to a bad end-to-end test suite.

Subcutaneous Test

A subcutaneous test is like an end-to-end test, except it takes place just below the UI. These can be great for something that should be an end-to-end test, but where the effort of simulating IU interaction would be too much work.

Practices

Novice

Novices to testing will tend to write tests with gaps in the coverage of critical logic, and that are too tightly coupled to implementation. Because the coupling causes the tests to be brittle, and suboptimal choices of what code to test results in test suites not catching bugs, the experience will be that of spending lots of time on the test suite while getting little value out of it.

So the main goal of someone new to testing should be developing a feeling for what to test and how to test it.

Badly factored code cannot be easily tested. Good testing practices will not be helpful if they’re not accompanied by an understanding of Object Oriented design (assuming an OO context) and refactoring practices.

Sandi Metz’s books are exellent resources on OO design

Martin Fowler wrote the definitive book on refactoring

Intermediate

Once a programmer has a solid grasp of design principles under their belt, and can write valuable tests, then the next step is learning how to let your tests inform your application design.

Well written code is easy to change. Code that is easy to change follows the single responsibility principle (so changes are isolated), and is loosly coupled to the system around it.

Tests are the first code reuse. This means that testing application code in a unit test is effectively using the code in a new context. Using the code in a new context will make any coupling to the rest of the system immediately obvious.

Code that has lots of dependencies and is heavily coupled to the system around it will require lots of annoying setup in the tests. If the code is to be isolated, then all of the objects that the code depends on will need to be replaced with test doubles. In this way test can give immediate design feedback.

Test Driven Development

TDD is the best way to allow information from your tests to guide your design. Writing the test first puts you into the mindset of first-and-formost thinking about the interface of your object and the dependencies it requires.

The red-green-refactor cycle results in good test coverage of the logic in your code, and allows for an iterative development style driven by the goals set out by the test.

(See Kent Beck’s Test Driven Development)

Behavior Driven Development

BDD is a style of TDD in which the application logic is written from the outside in. This workflow allows for maximum design benefit to be derived from the tests.

Working from the outside in means that by the time you get to implementing the logic in your objects, you’ve already seen what their interface needs to be.

Programmers tend to work in the opposite direction, from the inside out. Working inside out inevitably leads to guessing about what will be needed from the object interface. This results in wasted time on code that ultimately won’t be used.

(For a detailed treatment of this style of development see Growing Object Oriented Software, Guided by Tests)

Advanced

  • Property-based testing

Reference

Testing Pyramid

Patterns

Anti-patterns

Inverted Pyramid

Books