Unit Tests Are Overrated: Rethinking Testing Strategies
As the industry pushes to have shorter release cycles, it has struggled to maintain confidence in the quality of code getting released, which highlights the fundamental need for a successful continuous testing strategy.
For as much as companies have focused on DevOps and continuous delivery, the sad truth is that the trend of releasing software has remained relatively flat, with only 10% of teams releasing at least daily. Several strategies are getting deployed — shift left (e.g., more unit tests), shift right (e.g., test in production), transitioning from manual to automated testing and even removing distinct tester roles and trying to hold developers accountable for the quality of the code they are writing.
These testing strategies are met with varying degrees of success and can even differ from team to team based on skill level, maturity, and culture. Testing often holds up releases because teams do not optimize for — and are often disincentivized for focusing on—the most important components of digital confidence.
Unit testing (the process of checking small pieces of code to deliver information early and often), has been viewed as best practice, often supported by the argument that unit testing provides the most value during development because devs can catch errors quickly.
This idea has become so widely accepted that the term “unit testing” is now somewhat conflated with automated testing in general, losing part of its meaning and contributing to confusion.
Testing Is Not a Monolith
First, it’s essential to distinguish the types of tests that are written because they are easy to mix up and can lead to confusion. This article focuses on Functional Testing which verifies that the features work correctly; this can be distinguished from Accessibility Testing, Performance Testing, Security Testing, and Load Testing.
When thinking of functional tests, there are two primary uses: Acceptance Tests which verify that the new feature or bug fix does the thing it was intended to do and Regression Tests which are run regularly to verify that the new code didn’t break the old code.
Developers traditionally focus on acceptance testing, which is typically composed entirely of unit tests that verify the lowest level of code, like all the various possible inputs to each method in a class. In Agile circles this is often done with test-driven development (TDD) where developers first write the tests, then implement the features so that those tests pass. TDD itself is also overrated, as writing testable code is much more important than the actual tests.
Acceptance tests increasingly include Component Tests which isolate a set of code that shares a common function and ensures it works properly, and sometimes include Integration Tests which verify that various components interact with each other properly. In teams or departments that do not have separate QA or testing teams, acceptance tests may include DOM to Database (D2D) Tests, which are more often referred to by less precise terms like End to End (E2E) Tests or User Interface (UI) Tests. D2D tests evaluate the full stack from the interactions the end-user has with the UI to the API to the Service Layer to the Database and back.
Problems occur when acceptance tests are used as regression tests. This is the actual promise of Behavior Driven Development (BDD), which is a process often endorsed by Agile consultants to encourage communication between stakeholders — though in practice can add more layers of complexity to the process — where everyone agrees on the tests and then uses them.
The Testing Pyramid
The Testing Pyramid is the typical guidance for how to approach functional regression testing. The pyramid puts unit tests at the bottom, integration tests in the middle and D2D tests at the top. It reasons that because unit tests are faster and more reliable, they should be the primary focus whereas D2D tests receive the least focus (they are significantly slower and much more flaky), with integration tests falling somewhere in between.
One problem with this approach is that for organizations that make a more distinct separation of the tester and developer roles, the middle layer effectively gets ignored. Developers focus on unit tests because they get evaluated by unit test coverage percentage metrics. These tests aren’t sufficient; testers focus on inputs to the UI because this is how testing has traditionally been done. These tests are easy to do poorly and everyone who ends up having to write them ends up struggling with them.
Another problem is that the testing pyramid is not based on proper metrics. Anyone who has spent time testing websites that are regularly being updated knows that the expensive part isn’t how long it takes to create the test or even the time to execute the test, it is everything that goes into maintaining the tests.
This is why getting artificial intelligence projects to create basic tests isn’t the impressive accomplishment people seem to think it is. Speed isn’t the right metric and the comparison here focuses on the wrong costs. The metric needs to be the amount of confidence provided for the amount of resources required to maintain the test.
Unit tests fare much more poorly with this metric than most people realize. The first problem is that they often don’t provide useful information about the actual state of the system under review. When unit tests are written as acceptance tests, they are often intricately coupled with the specific implementation.
They will only fail if the implementation changes, not when changes break the system (e.g., verifying the value of a class constant). Using acceptance tests as regression tests must be done intentionally and thoughtfully, deleting everything that does not provide useful information about the system’s behavior.
Another major problem with unit tests is that to test the inputs of one method, you often need to mock out the responses from other methods. When you do this, you are no longer testing the system you have, you are testing a system that you assumed you had in the past. The system can break and a unit test will not fail because it had an assumption that an input would be received that the real-world system no longer supplies. The best place to do mocking is at the integration layer where you can mock out both sides of an interface with separate sets of tests and use contract tests to ensure that these mocks properly represent the actual state of the system from both sides. Additionally, mocking out services owned by other companies is good practice, especially if they have a stable API.
Because of these things, while they are indeed fast to execute, they are not necessarily any easier to maintain. For every product change, developers need to be diligent to look at all tests with mocks in the test suite to ensure they still apply. Then, any tests that break as a result of implementation changes need to be updated, even if they are often replaced with tests that are just as useless for detecting future regressions.
Test for Confidence Over Ease
While unit testing is seen by many as a cornerstone of software development best practices, it is not a panacea for all testing needs. As the industry evolves and aims for shorter release cycles, it must also evolve its approach to testing. A successful and continuous testing strategy demands a careful balance between various types of tests; instead of focusing solely on execution speed, teams should prioritize the amount of confidence each test provides relative to the time and effort invested in keeping it accurate.
By embracing a more holistic testing approach, teams can deliver better software faster. As with so much in life, the key is using the right tool for the job, not just the tool that is most convenient at the moment. When it comes to testing, smart and strategic beats brute force every time.